Chapter 5 — Link Aggregation, LACP, and VSX Multi-Chassis LAG

Learning Objectives

A single uplink between two switches is a single point of failure and a single bottleneck. Modern campus and data center designs eliminate both problems by bonding multiple physical links into one logical pipe and by pairing two physical chassis so they appear as one logical forwarding entity. On Aruba AOS-CX, the building blocks for that design are Link Aggregation Groups (LAGs), the Link Aggregation Control Protocol (LACP), and Virtual Switching Extension (VSX).

Section 1 — LAG Fundamentals

Pre-Reading Check — LAG Fundamentals

1. Which AOS-CX command converts a static LAG to a dynamic (LACP) LAG?

interface lag 1 dynamic
lacp mode active
enable lacp
protocol lacp 802.3ad

2. What happens if both ends of a link are configured with LACP passive mode?

The LAG comes up but with reduced bandwidth
The LAG comes up only after a 30-second hold timer
The LAG never comes up because neither side initiates LACPDUs
The LAG falls back to static mode automatically

3. Two 10G members in a LAG. How much throughput can a single TCP flow achieve?

20G — bandwidth aggregates per flow
10G — a single flow rides one member due to consistent hashing
15G — frames load-balance approximately evenly
5G — half of aggregate to allow headroom

4. What problem does lacp fallback-static solve?

It allows mismatched MTU between members
It bypasses hashing for high-priority traffic
It lets one member forward traffic when the LACP partner is silent (e.g., a server during PXE boot)
It forces the LAG to use static mode permanently

Think of a LAG like a multi-lane highway built between two cities. One lane (a single physical link) might carry the load most days, but during rush hour, or when a lane is closed for maintenance, the highway needs more capacity and redundancy. A LAG bundles multiple physical interfaces into one logical highway: bandwidth scales (close to) linearly, and a single lane closure doesn't shut down the road.

Static vs LACP

AOS-CX supports two LAG flavors:

Analogy: A static LAG is like a handshake agreement between two contractors — fast, but if either side forgets, the work doesn't line up. LACP is the same handshake plus a signed contract that's re-checked every second.

Animation — LACP Negotiation Handshake
Switch A LACP Active System ID: A Switch B LACP Active System ID: B LACPDU → ← LACPDU SYNC/COL/DST → ← SYNC/COL/DST LAG members: ALFNCD Active · Long-timeout · In-Sync Collecting · Distributing Heartbeats every 1s with 'lacp rate fast'
Active-Active LACP: both peers send LACPDUs, verify System IDs, then converge to forwarding state.

In AOS-CX, every LAG starts as static. You opt in to LACP by adding lacp mode active (or passive) under the LAG interface.

Static LAG configuration

switch# configure terminal
switch(config)# interface lag 1
switch(config-lag-if)# description "Static LAG to Peer"
switch(config-lag-if)# no shutdown
switch(config-lag-if)# exit
switch(config)# interface 1/1/1
switch(config-if)# lag 1
switch(config-if)# no shutdown
switch(config)# interface 1/1/2
switch(config-if)# lag 1
switch(config-if)# no shutdown

Convert to LACP

switch(config)# interface lag 20
switch(config-lag-if)# description "LACP LAG to Peer"
switch(config-lag-if)# lacp mode active
switch(config-lag-if)# lacp rate fast
switch(config-lag-if)# no shutdown

lacp rate fast shortens the LACPDU heartbeat from 30 s to 1 s — failure detection in roughly 3 s.

LACP Active vs Passive

ModeBehavior
ActiveSends LACPDUs continuously. Initiates negotiation.
PassiveOnly responds to LACPDUs. Will not bring up the LAG without an active partner.

The rule: at least one side must be active. Passive-on-passive is the silent-treatment failure mode.

LAG Hash Algorithms

The switch decides which member port carries each frame using a hash over packet fields (typically src/dst MAC, IP, and L4 ports). Consistency matters because reordering frames within a flow breaks TCP performance. AOS-CX uses an L2/L3/L4 hash by default. Two 10G members yield 20G aggregate but still 10G max per flow.

Member Port Consistency

Always configure VLAN/L3 settings on the LAG interface, never on individual members.

LACP Fallback-Static

From AOS-CX 10.02, lacp fallback-static lets one member port forward traffic if the LACP partner is unresponsive — the classic PXE-boot fix.

switch(config)# interface lag 30
switch(config-lag-if)# lacp mode active
switch(config-lag-if)# lacp fallback-static

Verification

switch# show interface lag 1
switch# show lacp interfaces

Healthy member shows flags: ALFNCD — Active, Long-timeout, Fast/slow, In-sync, Collecting, Distributing.

Key Points

Post-Reading Check — LAG Fundamentals

1. Which AOS-CX command converts a static LAG to a dynamic (LACP) LAG?

interface lag 1 dynamic
lacp mode active
enable lacp
protocol lacp 802.3ad

2. What happens if both ends of a link are configured with LACP passive mode?

The LAG comes up but with reduced bandwidth
The LAG comes up only after a 30-second hold timer
The LAG never comes up because neither side initiates LACPDUs
The LAG falls back to static mode automatically

3. Two 10G members in a LAG. How much throughput can a single TCP flow achieve?

20G — bandwidth aggregates per flow
10G — a single flow rides one member due to consistent hashing
15G — frames load-balance approximately evenly
5G — half of aggregate to allow headroom

4. What problem does lacp fallback-static solve?

It allows mismatched MTU between members
It bypasses hashing for high-priority traffic
It lets one member forward traffic when the LACP partner is silent (e.g., a server during PXE boot)
It forces the LAG to use static mode permanently

Section 2 — VSX Architecture

Pre-Reading Check — VSX Architecture

1. How many control planes does a VSX pair operate with?

One shared control plane elected from both peers
Two independent control planes that synchronize state
A single primary control plane; the secondary is passive
No control plane — VSX is purely data plane

2. What is the role of the VSX keepalive?

Carry user data when the ISL is down
Synchronize MAC tables between peers
Detect peer aliveness independently of the ISL to prevent split-brain
Provide management access to the secondary peer

3. Why does VSX active-gateway eliminate FHRP failover delay?

It elects a single primary that pre-warms its ARP cache
Both peers always answer for the same virtual IP/MAC simultaneously
It uses GARP every 10 ms
It is faster VRRP — sub-second hello timers

4. The ISL is configured as a multi-chassis LAG with which keyword?

vsx-isl
multi-chassis
peer-link
cluster-link

A LAG handles redundancy within one chassis. But what if the chassis fails? Stacking shares a single control plane (one bug, everyone falls down). VSX takes a different approach.

VSX (Virtual Switching Extension) clusters exactly two AOS-CX switches with independent control planes that synchronize state over a dedicated link. From a downstream device's perspective, the pair acts as one switch — but each peer runs its own copy of OSPF, BGP, STP, and management processes. If one peer reboots or hits a bug, the other keeps forwarding.

Analogy: Stacking is one brain controlling two bodies — efficient until the brain has a stroke. VSX is two pilots in a cockpit, each fully qualified, sharing notes constantly. If one passes out, the other already knows the plan.

Animation — VSX Architecture: ISL, Keepalive, MC-LAG
Keepalive (L3, VRF 'ka') peer aliveness · prevents split-brain VSX-1 (Primary) Independent Control Plane OSPF · BGP · STP · MGMT VSX-2 (Secondary) Independent Control Plane OSPF · BGP · STP · MGMT ISL (multi-chassis LAG) L2 sync · MAC sync · fallback Server / ToR / Router sees ONE logical switch VSX LAG mbr-1 VSX LAG mbr-2
Both peers forward simultaneously (active-active). The ISL syncs state; the keepalive (orange) confirms peer aliveness.

ISL — Inter-Switch Link

The ISL is a multi-chassis LAG between the two VSX peers carrying:

Sizing rule: ISL bandwidth should equal the largest single VSX LAG capacity, or higher.

interface lag 100 multi-chassis
   description ISL
   no shutdown
   no routing
   vlan trunk native none
   vlan trunk allowed all
   lacp mode active

interface 1/1/4
   lag 100
interface 1/1/5
   lag 100

vsx
   inter-switch-link lag 100

Keepalive

The keepalive is a dedicated L3 link (not the ISL) used to detect peer aliveness. If the ISL goes down, both peers consult the keepalive: if alive but ISL-disconnected, peers enter a defined recovery state. If both ISL and keepalive are down, each assumes the partner is dead.

vrf ka

interface 1/1/6
   vrf attach ka
   ip address 192.168.0.1/30
   no shutdown

vsx
   keepalive peer 192.168.0.2 source 192.168.0.1 vrf ka

Primary vs Secondary Role

vsx
   system-mac 02:01:00:00:01:00
   inter-switch-link lag 100
   role primary
   keepalive peer ...

Both peers actively forward data. The role matters for: configuration sync direction, tie-breaking, and LSU orchestration. A shared system-mac is required so downstream LACP partners see one System ID across both chassis.

Active-Gateway and Active-Forwarding

Both peers respond to the same virtual IP and MAC on a VLAN. Any host can route locally through whichever peer it reaches first — no failover delay because there is no failover.

vlan 10
   name employee

interface vlan 10
   description employee
   vsx-sync active-gateways
   ip address 172.17.0.2/24
   active-gateway ip 172.17.0.1 mac 12:00:00:00:00:01

vsx-sync Configuration

Common vsx-sync targetsWhy
VLANsBoth peers must know all VLANs traversing the ISL or VSX LAGs
Active-gatewaysBoth peers must answer the same gateway IP/MAC
ACLsSymmetric forwarding requires symmetric policy
QoS classifiers and queuesAsymmetric QoS produces asymmetric latency
Route maps and prefix listsRouting policy must match on both peers

Key Points

Post-Reading Check — VSX Architecture

1. How many control planes does a VSX pair operate with?

One shared control plane elected from both peers
Two independent control planes that synchronize state
A single primary control plane; the secondary is passive
No control plane — VSX is purely data plane

2. What is the role of the VSX keepalive?

Carry user data when the ISL is down
Synchronize MAC tables between peers
Detect peer aliveness independently of the ISL to prevent split-brain
Provide management access to the secondary peer

3. Why does VSX active-gateway eliminate FHRP failover delay?

It elects a single primary that pre-warms its ARP cache
Both peers always answer for the same virtual IP/MAC simultaneously
It uses GARP every 10 ms
It is faster VRRP — sub-second hello timers

4. The ISL is configured as a multi-chassis LAG with which keyword?

vsx-isl
multi-chassis
peer-link
cluster-link

Section 3 — VSX LAG Configuration

Pre-Reading Check — VSX LAG Configuration

1. When the ISL fails but the keepalive remains up, what happens?

Both peers shut down their VSX LAG members
The secondary disables its VSX LAG member ports; the primary continues forwarding alone
The primary reboots to recover the ISL
Traffic continues normally on both peers

2. What does linkup-delay-timer protect against?

Ports flapping under storm-control
A rebooted peer attracting traffic before it has synced state from its partner — black-holing it
The primary becoming secondary by mistake
LACP partners timing out during link bring-up

3. To configure a VSX LAG that spans both peers, you must use:

Different LAG IDs on each peer
The same LAG ID and the multi-chassis keyword on both peers
Static LAG only — LACP is incompatible with VSX
A reserved LAG number above 256

4. Why must the keepalive ride a different physical path than the ISL?

For QoS prioritization
To reduce LACPDU collisions
If they share fiber, a single fiber cut produces split-brain — defeating the design
VSX licensing requires it

A VSX LAG (also called MC-LAG, multi-chassis LAG) is a single logical LAG whose member ports are split across the two VSX peers. The downstream device believes it is talking to one switch with two links.

Analogy: Two phone lines from two different carriers, but a magic phone that lets you publish one number that rings on both.

Defining VSX Peers and ISL — Putting It All Together

A complete minimal VSX bring-up on the primary:

! 1. Define the ISL LAG
interface lag 100 multi-chassis
   no shutdown
   no routing
   vlan trunk native none
   vlan trunk allowed all
   lacp mode active

interface 1/1/49
   lag 100
interface 1/1/50
   lag 100

! 2. Keepalive in its own VRF
vrf ka
interface 1/1/48
   vrf attach ka
   ip address 192.168.0.1/30
   no shutdown

! 3. Bind under VSX
vsx
   system-mac 02:01:00:00:01:00
   inter-switch-link lag 100
   role primary
   keepalive peer 192.168.0.2 source 192.168.0.1 vrf ka
   linkup-delay-timer 180

The linkup-delay-timer prevents a rebooted peer from forwarding on its VSX LAG members until it has fully synced state. Without it, the rebooted peer might attract traffic and black-hole it for 30+ seconds.

Multi-Chassis LAGs — Configuring a VSX LAG

Identical config on both peers, with one member port from each peer:

interface lag 10 multi-chassis
   description "VSX LAG to ToR-1"
   no shutdown
   no routing
   vlan trunk native 1
   vlan trunk allowed 10,20,30
   lacp mode active

Then, on each peer, add a local member:

! Primary
interface 1/1/1
   description "ToR-1 link 1"
   lag 10

! Secondary
interface 1/1/1
   description "ToR-1 link 2"
   lag 10

Split-Brain Prevention

ISL StateKeepalive StatePeer Behavior
UpUpNormal active-active forwarding
DownUpSecondary disables its VSX LAG member ports. Primary continues forwarding alone.
UpDownWarning logged; forwarding continues; admin should fix keepalive immediately
DownDownEach peer assumes partner is dead; both forward independently

Crucial: the keepalive must ride a different physical path from the ISL.

VSX State Machine

stateDiagram-v2
    [*] --> Booting
    Booting --> LinkupDelay: chassis powers on
    LinkupDelay --> ActiveActive: timer expires; ISL up; KA up
    ActiveActive --> ISLDown_KAUp: ISL fails
    ActiveActive --> ISLUp_KADown: keepalive fails
    ISLDown_KAUp --> SecondaryShutdownVSXLAGs: secondary disables VSX LAG members
    SecondaryShutdownVSXLAGs --> ActiveActive: ISL restored
    ISLUp_KADown --> ActiveActive: keepalive restored
    ActiveActive --> SplitBrain: ISL and KA both fail
    SplitBrain --> ActiveActive: both restored; renegotiate
    ActiveActive --> [*]: graceful shutdown

Verification

switch# show vsx status
switch# show vsx brief
switch# show vsx config
switch# show vsx config keepalive
switch# show interface lag 10
switch# show lacp interfaces
switch# show lacp aggregates

A healthy show vsx status confirms ISL up, keepalive established, roles primary/secondary, configuration in-sync, and linkup-delay timer expired.

Key Points

Post-Reading Check — VSX LAG Configuration

1. When the ISL fails but the keepalive remains up, what happens?

Both peers shut down their VSX LAG members
The secondary disables its VSX LAG member ports; the primary continues forwarding alone
The primary reboots to recover the ISL
Traffic continues normally on both peers

2. What does linkup-delay-timer protect against?

Ports flapping under storm-control
A rebooted peer attracting traffic before it has synced state from its partner — black-holing it
The primary becoming secondary by mistake
LACP partners timing out during link bring-up

3. To configure a VSX LAG that spans both peers, you must use:

Different LAG IDs on each peer
The same LAG ID and the multi-chassis keyword on both peers
Static LAG only — LACP is incompatible with VSX
A reserved LAG number above 256

4. Why must the keepalive ride a different physical path than the ISL?

For QoS prioritization
To reduce LACPDU collisions
If they share fiber, a single fiber cut produces split-brain — defeating the design
VSX licensing requires it

Section 4 — VSX Lifecycle Operations

Pre-Reading Check — VSX Lifecycle Operations

1. During an LSU, which peer is upgraded first?

Primary, then secondary
Secondary, then primary
Both simultaneously, with traffic re-routing externally
Whichever has the most uptime

2. Which platform supports VSF but NOT VSX?

AOS-CX 8325
AOS-CX 8320
AOS-CX 6300
AOS-CX 6400

3. What is the maximum number of switches in a VSX cluster?

2
4
8
10

4. When replacing a failed VSX peer chassis, which is NOT required of the replacement?

Identical model
Same software version as the surviving peer
Same system-mac in VSX config
Identical serial number to the failed unit

Standing up a VSX pair is the easy part. The real test is upgrading it without taking the network down.

Live Software Upgrades (LSU / ISSU)

A single-command rolling upgrade of a VSX pair, run on the primary:

switch# vsx update-software tftp://10.1.1.5/CX-10.07.swi secondary vrf mgmt

The flow:

  1. Image staging — secondary downloads to its secondary partition; primary's running code is untouched.
  2. Secondary reboots to the new image. Primary handles all traffic via VSX LAG redistribution and ISL (~1–3 min window).
  3. Secondary rejoins with the new code, takes over forwarding.
  4. Primary upgrades and reboots. Newly upgraded secondary now handles traffic.
  5. Primary rejoins with the new code. Roles return to original.
Animation — Live Software Upgrade Sequence (No Traffic Loss)
1. Issue: vsx update-software 2. Secondary reboots to new image 3. Secondary back; primary reboots 4. Both peers on new image No traffic loss throughout Primary (active) handles all traffic Secondary rebooting... Primary rebooting... Secondary new image, active Primary new image · active Secondary new image · active Upgrade complete · roles restored show vsx status: in-sync
Rolling upgrade: secondary first, then primary. The peer not being upgraded keeps every flow forwarding.

Pre-flight Checklist

CheckCommandWhy
Both peers in syncshow vsx statusLSU refuses to start on an unhealthy pair
Sufficient disk spaceshow imagesSecondary partition must have room
Image accessibleTest TFTP/SCP from both peersBoth peers must reach the image server
Linkup-delay-timer setshow vsx configPrevents black-holing on rejoin
Backup configcopy running-config tftp://...Standard change control

LSU is hardware-specific. Virtual AOS-CX (OVA distribution) cannot be live-upgraded — it requires deploying a new VM and migrating config.

VSX Restore and Recovery

  1. Replace the chassis (identical model required).
  2. Restore configuration from backup.
  3. Reconnect ISL and keepalive cables before powering on, if possible.
  4. The healthy peer detects the new chassis via keepalive and ISL.
  5. vsx-sync pushes synchronized configuration to the new peer automatically.
  6. Verify with show vsx status and confirm both peers are in-sync before declaring complete.

VSX vs Stacking vs VSF — Comparison

AspectTraditional StackingVSFVSX
Control planesOne sharedOne shared (master elected)Two independent, synchronized
Member countUp to 82–10Exactly 2
Failure domainStack master fault risks allStack master fault risks allPeer fault contained
TopologyChain or ringChain or ringPoint-to-point (ISL + KA)
Use caseAccess layerAccess/aggregationCore, aggregation, DC
MC-LAGLAG within stackLAG within stackTrue MC-LAG, active-active
Upgrade impactFull rebootRolling on some platformsLSU — minimal impact
8320 / 8325N/ANot supportedSupported (only option)
6300N/ASupported (only option)Not supported
6400N/ASupportedSupported
MgmtSingle IPSingle IPTwo IPs (one per peer)

The trade-off: VSF is simpler to manage; VSX is more resilient.

Best Practices Summary

Key Points

Post-Reading Check — VSX Lifecycle Operations

1. During an LSU, which peer is upgraded first?

Primary, then secondary
Secondary, then primary
Both simultaneously, with traffic re-routing externally
Whichever has the most uptime

2. Which platform supports VSF but NOT VSX?

AOS-CX 8325
AOS-CX 8320
AOS-CX 6300
AOS-CX 6400

3. What is the maximum number of switches in a VSX cluster?

2
4
8
10

4. When replacing a failed VSX peer chassis, which is NOT required of the replacement?

Identical model
Same software version as the surviving peer
Same system-mac in VSX config
Identical serial number to the failed unit

Your Progress

Answer Explanations