November last year, a pair of Cisco Nexus 5010 switches, suddenly started rebooting randomly without user intervention. Since these boxes were a front to a VM environment, stability were of urgent concern. But in order to stabilize the environment, the root cause of the reboots had to be isolated, and quickly.
The Cisco Nexus platform might not be as mature as many would like, but it is quickly becoming a very needed switch in Next-Generation datacenters. Of the things I like most about the Nexus boxes are the readily available local reporting and intuitive system checks. Obviously there are many other features which is making the platform so popular. I’ll cover some of these in time.
Coming back to the rebooting issue. Unlike IOS devices that looses all local logging info, unless a crash dump was saved to NVRAM, the Nexus writes most of its log information to disk. Thus even after the reboot, you have all the information.
The following command proved to be more valuable than its description on CCO.
#show processes log [detail]
The output this command provided, was a direct pointer to the protocol that caused the reboots. Some kind of CDP communication somehow managed to upset to two N5K boxes enough to force the reboots.
# sh proc log Process PID Normal-exit Stack Core Log-create-time --------------- ------ ----------- ----- ----- --------------- cdp 2823 N Y N Thu Nov 4 11:55:06 2010 cdp 2847 N Y N Thu Nov 4 12:39:06 2010 cdp 2861 N Y Y Thu Nov 4 13:22:07 2010 cdp 2862 N Y N Thu Nov 4 12:52:30 2010
Obviously disabling CDP globally was an immediate fix. But surely it is in the interest of solving the problem to find out where this CDP traffic was coming from, or which host had the ability to reboot two Nexus switches.
By taking a closer look at the output of one of these files it was evident that the suspect traffic was coming from a host/device connected to Ethernet1/13.
# sh proc log pid 2861 ====================================================== Service: cdp Description: CDP Daemon Started at Thu Nov 4 13:18:55 2010 (764050 us) Stopped at Thu Nov 4 13:22:07 2010 (179867 us) Uptime: 3 minutes 12 seconds Start type: SRV_OPTION_RESTART_STATELESS (23) Death reason: SYSMGR_DEATH_REASON_FAILURE_SIGNAL (2) Last heartbeat 4.91 secs ago RLIMIT_AS: 129772032 System image name: n5000-uk9.4.1.3.N2.1a.bin System image version: 4.1(3)N2(1a) S0 PID: 2861 Exit code: signal 11 (core dumped) CWD: /var/sysmgr/work Virtual Memory: CODE 08048000 - 08082654 DATA 08083654 - 080845E8 BRK 08098000 - 080FB000 STACK BFFFFA60 TOTAL 125016 KB Register Set: EBX 00000000 ECX 00000019 EDX 00000000 ESI 00000000 EDI 00000000 EBP BFFFF0A8 EAX 00000000 XDS 0000007B XES 0000007B EAX FFFFFFFF (orig) EIP 0806106F XCS 00000073 EFL 00010206 ESP BFFFEE30 XSS 0000007B Stack: 3120 bytes. ESP BFFFEE30, TOP BFFFFA60 0xBFFFEE30: 080A92E4 080878D1 00000004 B7E6B4B3 .....x.......... 0xBFFFEE40: 00000065 BFFFEE50 20640000 00000000 e...P.....d .... 0xBFFFEE50: 00000000 00000000 00000000 00000010 ................ 0xBFFFEE60: 00000000 00000000 00000000 080A92E0 ................ 0xBFFFEE70: 00010000 0808776F 000000CC B7E9C171 ....ow......q... 0xBFFFEE80: 00000001 BFFFF344 00000000 00000001 ....D........... 0xBFFFEE90: 00000002 BFFFF48C BFFFF340 00000000 ........@....... 0xBFFFEEA0: B7FB9AB3 0000000B B7275DCA B7321FF4 .........]'...2. 0xBFFFEEB0: BFFFEF10 BFFFEEDC B7276F4A BFFFEF10 ........Jo'..... 0xBFFFEEC0: BFFFF310 BFFFF32F 00000000 BFFFF32F ..../......./... 0xBFFFEED0: B7321FF4 00000020 BFFFEF10 3331EFFC ..2. .........13 0xBFFFEEE0: B7271E76 BFFFEF10 B7321FF4 BFFFF02C v.'.......2.,... 0xBFFFEEF0: BFFFEF10 BFFFEFFC B7271E8D BFFFEF10 ..........'..... ..... 0xBFFFF2F0: 0809FFD4 B70DCCB0 BFFFF358 B70DB4F9 ........X....... 0xBFFFF300: 080D41CC BFFFF330 BFFFF32C B7DB02E0 .A..0...,....... 0xBFFFF310: 65687445 74656E72 33312F31 00000000 Ethernet1/13.... 0xBFFFF320: 00000000 00000000 00000000 00000000 ................ 0xBFFFF330: 00000000 00000000 00000000 00000000 ................ 0xBFFFF340: 080E0000 00000000 BFFFF368 0807055D ........h...]... 0xBFFFF350: 00000001 00000000 00000000 00 ... -full output omitted-
In this case it was an ESX host on Eth1/13 sending malformed CDP packets which caused the problem.
I reported this to Cisco-TAC. Lucky for everyone else this was resolved in the the NX-OS release that came out in December.
I hope this post was informative to assist others in finding similar or new problems.
Thanks Ruhan
very useful info and TIP
Very informative, thanks a lot.