We all too familiar with the devastating impact a talented layer 2 loop could have on a data center lacking sufficient controls and processes. If you are using Cisco Nexus switches in your data center, you would be happy to know that NX-OS offers an interesting new tool you should add to your loop detection list. The somewhat undocumented feature is known as (for the lack of a better name) FWM-Loop Detection. FWM refers to the NX-OS Forwarding Manager. In Syslog it is seen as:
%FWM-2-STM_LOOP_DETECT
How does it work? When NX-OS detect a series of MAC flap events that exceeds an Cisco defined limit, NX-OS classifies that as a potential layer2 loop. To protect the CPU from continuously updating the CAM table with the same MAC address(es) flapping between interfaces, dynamic MAC address learning is disabled for the entire switch for 3 minutes. This safeguards the switch during that time, if the problem happened to be a once off issue. After 3 minutes dynamic MAC learning is enabled again as normal. If the problem persists, dynamic MAC learning is disabled again.
The FWM-Loop Detection feature is enabled by default and has no configurable metrics that can be set.
Let’s take a look at this in action. For the output below I have induced a physical loop between two Fabric Extender interfaces and disabled all loop prevention mechanisms. The Syslog events are as follow (with the times omitted):
%FWM-6-MAC_MOVE_NOTIFICATION: Host 4055.3927.30c2 in vlan 200 is flapping between port Eth119/1/3 and port Po750 %FWM-6-MAC_MOVE_NOTIFICATION: Host 4055.3927.30c2 in vlan 200 is flapping between port Po750 and port Eth119/1/3 %FWM-6-MAC_MOVE_NOTIFICATION: Host 4055.3927.30c2 in vlan 200 is flapping between port Eth119/1/3 and port Eth119/1/4 %FWM-6-MAC_MOVE_NOTIFICATION: Host 4055.3927.30c2 in vlan 200 is flapping between port Eth119/1/4 and port Eth119/1/3 %FWM-6-MAC_MOVE_NOTIFICATION: Host 4055.3927.30c2 in vlan 200 is flapping between port Eth119/1/3 and port Eth119/1/4 %FWM-6-MAC_MOVE_NOTIFICATION: Host 4055.3927.30c2 in vlan 200 is flapping between port Eth119/1/4 and port Eth119/1/3 %FWM-6-MAC_MOVE_NOTIFICATION: Host 4055.3927.30c2 in vlan 200 is flapping between port Eth119/1/3 and port Eth119/1/4 %FWM-2-STM_LOOP_DETECT: Loops detected in the network for mac 4055.3927.30c2 among ports Eth119/1/4 and Eth119/1/3 vlan 200 - Disabling dynamic learn notifications for 180 seconds
As seen above once detected, the FWM Loop Detection kicks in and disables MAC learning. After 180 seconds MAC learning is enabled and the following Syslog event is generated:
%FWM-2-STM_LEARNING_RE_ENABLE: Re enabling dynamic learning on all interfaces
Although this feature (as is today) will not prevent a layer2 meltdown on its own, it certainly does delay the race condition that ultimately ends up total downtime. I would recommend adding the FWM-2-STM_LOOP_DETECT event to your syslog monitoring events since it is a effective early warning system that could help in identifying a potential problem in your data center.
On the negative side, currently the FWM Loop Detect feature disables MAC learning for all VLANs not just the offending VLAN. An enhancement request was logged with vendor-C, to only penalize the offending VLAN . Hopefully this will be available in a future code release.
The problem I see with more aggressive configurations is that sometimes, the systems guys connect a server to two different switches and then configure bonding/NIC teaming in aggregation mode instead of active-backup mode. In this case the switch would detect a false loop as the system’s MAC would flap due a server misconfiguration.
True, but with configurable parameters it could be set less restrictive. Also mis-configured nic teaming do generate MAC but not a high rate from my experience. Even with nic teaming problem the ports that will be error disabled will be the nic-teaming ports, bringing the fault to the server guys.
Another interesting post. Nice one!
So I have just been reading about VMWares load balancing algorithm that will rebalance VM Guest traffic based on uplink load. The load is evaluated every 30 seconds and hosts are moved in order to rebalance traffic as required. I am familiar with the effects caused in NX-OS by MAC flapping so was wondering if this load rebalancing could trigger the loop detection mechanisms?
Does anyone have any experience using this setup?
That is only triggered is X amount of MAC flap events is detected in a set period of time.
So moving a couple guests moving should be fine. Moving a 1000 might be a different story.
But testing this would be the best way to confirm. :)
STM_LOOP_DETECT is caused by “move backs”, This message indicates that the switch receives
frames with the same source MAC address on these two interfaces and that the switch learns
the same MAC address on these interfaces at a very high speed. The switch detects this as a loop.
The switch disables MAC address learning to protect its control plane.
This is implemented on all VLANs even if the loop occurred on only one VLAN.
In Release 5.2(1)N1(1) and later, this behavior was changed to disable learning on only the VLAN where the loop occurred.
An interesting command which helps for troubleshooting L2 loops in the future is the ‘mac-address-table notification’ command
The addition of these commands ensures that the syslog for FWM detect displays when there is a MAC address move.
If the loop is still active you can also look for on the N5k for a port which is having very high utilization
‘Show hardware internal carmel rate’ can be usefull in this case.
I would recommend you to monitor the logs on the N7k as well.
On the N7k we can look for more addresses and isolate them from the network by increasing the logging level for “l2fm” on the N7k.
This will report the flapping mac-addresses by using:
‘logging level l2fm 5’