Troubleshooting MAC-Flushes on NX-OSJanuary 21, 2013
An interesting client problem in one of our multi-tenant data centers came to my attention the other day. A delay sensitive client noticed a slight increase in latency (20 ms) at very intermittent intervals from his servers in our data center to specific off-net destinations. The increase in latency was localized to the pair of Nexus 7000’s functioning as the core switch layer (CSW) and the layer3 edge for this particular data center. Beyond that all appeared normal on the N7K CSWs.
A TCP dump from a normal trunk interface attached to the N7Ks, showed unicast traffic on the N7K-2 device when the N7K-1 device was setup to receive internet traffic inbound and forward it into the data center client VLANs. The N7Ks are setup using the Cisco VPC (Virtual Port Channels).
Upon investigating what appeared to be legitimate unicast traffic, the IP ARP tables showed the relevant destination MAC addresses, with the timers not indicating any recent problems. The host MAC addresses for these ARP entries however were absent in the CAM table. After forcing a refresh of both tables it was obvious that there was a problem with the MAC address entries, not refreshing as they should.
n7k-2#clear ip arp vlan 600 n7k-2#clear mac address-table dynamic vlan 600
n7k-2# show ip arp vlan 600 | be ARP IP ARP Table Total number of entries: 138 Address Age MAC Address Interface xx.xx.xx.4 00:00:07 0025.9003.855e Vlan600 xx.xx.xx.5 00:00:07 0025.9003.859a Vlan600 xx.xx.xx.6 00:00:07 0025.9003.252c Vlan600 xx.xx.xx.7 00:00:07 0025.9003.2548 Vlan600 xx.xx.xx.8 00:00:07 0025.9003.855e Vlan600 xx.xx.xx.9 00:00:07 0025.9003.859a Vlan600 --snip--
n7k-2# show mac add vlan 600 | b VLAN VLAN MAC Address Type age Secure NTFY Ports/SWID.SSID.LID ---------+-----------------+--------+---------+------+----+------------------ G 600 0026.980c.dbc2 static - F F sup-eth1(R) n7k-2#
By this stage I had my suspicions about the problem but not yet the exact cause.
NX-OS has a range of very useful (yet poorly documented) internal system commands that offer a great deal more information than the usual show commands. Inspecting the L2FM (Layer2 Feature Manager) state for a given MAC address could verify my suspicions.
The command below showed a brief historical event log of the Layer 2 MAC Database.
n7k-2# show system internal l2fm l2dbg macdb address 0025.9003.2548 vlan 600 Legend ------ Db: 0-MACDB, 1-GWMACDB, 2-SMACDB, 3-RMDB, 4-SECMACDB Src: 0-UNKNOWN, 1-L2FM, 2-PEER, 3-LC, 4-HSRP 5-GLBP, 6-VRRP, 7-STP, 8-DOTX, 9-PSEC 10-CLI 11-PVLAN 12-ETHPM, 13-ALW_LRN, 14-Non_PI_MOD, 15-MCT_DOWN, 16 - SDB 17-OTV Slot:0 based for LCS 19-MCEC 20-OTV/ORIB VLAN: 600 MAC: 0025.9003.2548 Time If/swid Db Op Src Slot Wed Jan 9 11:56:54 2013 0x160002ef 0 INSERT 3 19 0 Wed Jan 9 11:56:54 2013 0x160002ef 0 INSERT 2 0 15 Wed Jan 9 11:56:56 2013 0x160002ef 0 FLUSH 0 0 15 Wed Jan 9 11:56:56 2013 0x160002ef 0 DELETE 0 0 15 Wed Jan 9 11:56:56 2013 0x160002ef 0 INSERT 3 19 0 Wed Jan 9 11:56:56 2013 0x160002ef 0 INSERT 2 0 15 Wed Jan 9 11:56:56 2013 0x160002ef 0 FLUSH 2 0 15 Wed Jan 9 11:56:56 2013 0x160002ef 0 DELETE 0 0 15 Wed Jan 9 11:56:56 2013 0x160002ef 0 INSERT 3 19 0 Wed Jan 9 11:56:56 2013 0x160002ef 0 INSERT 2 0 15
The output above indicated why the MAC addresses were not seen in the CAM table. They were continually flushed.
This explained the rogue unicast traffic. What happens to unicast traffic with valid IP ARP entries, when no useable MAC addresses are available for forwarding? They are flooded using a mechanism known as unknown unicast flooding.
With the problem described originally, the MAC flushes also explained the latency spikes, as one of the questionable VLAN’s belonged to a content provider, carrying large amounts of traffic. Every time the CDN hit a specific volume of traffic the unicast flooding increased the queue depths on certain N7Ks trunk links to customers. This, due to the large volumes of traffic, was enough to increase the latency for some customers.
Then to isolate the cause of the flushing MAC entries either the following system internal command:
show spanning-tree internal event-history all brief
Or the normal “show spanning-tree detail” command could be used. This showed the cause of the MAC flushes:
n7k-2# show spann detail | inc exec|from|occurr MST0000 is executing the mstp compatible Spanning Tree protocol Number of topology changes 18167 last change occurred 0:00:06 ago from port-channel753 MST0001 is executing the mstp compatible Spanning Tree protocol Number of topology changes 17726 last change occurred 0:00:06 ago from port-channel753 MST0002 is executing the mstp compatible Spanning Tree protocol Number of topology changes 18390 last change occurred 0:01:04 ago from port-channel750 --snip--
When an MST switches receive TCN (Topology Change Notification), the associate MAC addresses in the CAM table are flushed. This is done to allow quicker convergence than the traditional STP implementation, but on the flip side, continual TCNs, have negative effect as seen here. In this case the TCNs were generated due to an incorrectly configured switch.
– – – –
For more information about how STP and MST operates be sure to go through
the switching chapter in the Routing-Bits RS Handbook.