Troubleshooting random Nexus reboots

November last year, a pair of Cisco Nexus 5010 switches, suddenly started rebooting randomly without user intervention.  Since these boxes were a front to a VM environment, stability were of urgent concern. But in order to stabilize the environment, the root cause of the reboots had to be isolated, and quickly.

The Cisco Nexus platform might not be as mature as many would like, but it is quickly becoming a very needed switch in Next-Generation datacenters. Of the things I like most about the Nexus boxes are the readily available local reporting and intuitive system checks.  Obviously there are many other features which is making the platform so popular. I’ll cover some of these in time.

Coming back to the rebooting issue. Unlike IOS devices that looses all local logging info, unless a crash dump was saved to NVRAM, the Nexus writes most of its log information to disk. Thus even after the reboot, you have all the information.
The following command proved to be more valuable than its description on CCO.

#show processes log [detail]

The output this command provided, was a direct pointer to the protocol that caused the reboots. Some kind of CDP communication somehow managed to upset to two N5K boxes enough to force the reboots.

# sh proc log
Process          PID     Normal-exit  Stack  Core   Log-create-time
---------------  ------  -----------  -----  -----  ---------------
cdp              2823              N      Y      N  Thu Nov  4 11:55:06 2010
cdp              2847              N      Y      N  Thu Nov  4 12:39:06 2010
cdp              2861              N      Y      Y  Thu Nov  4 13:22:07 2010
cdp              2862              N      Y      N  Thu Nov  4 12:52:30 2010

Obviously disabling CDP globally was an immediate fix. But surely it is in the interest of solving the problem to find out where this CDP traffic was coming from, or which host had the ability to reboot two Nexus switches.

By taking a closer look at the output of one of these files it was evident that the suspect traffic was coming from a host/device connected to Ethernet1/13.

# sh proc log pid 2861
======================================================
Service: cdp
Description: CDP Daemon

Started at Thu Nov  4 13:18:55 2010 (764050 us)
Stopped at Thu Nov  4 13:22:07 2010 (179867 us)
Uptime: 3 minutes 12 seconds

Start type: SRV_OPTION_RESTART_STATELESS (23)
Death reason: SYSMGR_DEATH_REASON_FAILURE_SIGNAL (2)
Last heartbeat 4.91 secs ago
RLIMIT_AS: 129772032
System image name: n5000-uk9.4.1.3.N2.1a.bin
System image version: 4.1(3)N2(1a) S0

PID: 2861
Exit code: signal 11 (core dumped)

CWD: /var/sysmgr/work

Virtual Memory:

CODE      08048000 - 08082654
DATA      08083654 - 080845E8
BRK       08098000 - 080FB000
STACK     BFFFFA60
TOTAL     125016 KB

Register Set:

EBX 00000000         ECX 00000019         EDX 00000000
ESI 00000000         EDI 00000000         EBP BFFFF0A8
EAX 00000000         XDS 0000007B         XES 0000007B
EAX FFFFFFFF (orig)  EIP 0806106F         XCS 00000073
EFL 00010206         ESP BFFFEE30         XSS 0000007B

Stack: 3120 bytes. ESP BFFFEE30, TOP BFFFFA60

0xBFFFEE30: 080A92E4 080878D1 00000004 B7E6B4B3 .....x..........
0xBFFFEE40: 00000065 BFFFEE50 20640000 00000000 e...P.....d ....
0xBFFFEE50: 00000000 00000000 00000000 00000010 ................
0xBFFFEE60: 00000000 00000000 00000000 080A92E0 ................
0xBFFFEE70: 00010000 0808776F 000000CC B7E9C171 ....ow......q...
0xBFFFEE80: 00000001 BFFFF344 00000000 00000001 ....D...........
0xBFFFEE90: 00000002 BFFFF48C BFFFF340 00000000 ........@.......
0xBFFFEEA0: B7FB9AB3 0000000B B7275DCA B7321FF4 .........]'...2.
0xBFFFEEB0: BFFFEF10 BFFFEEDC B7276F4A BFFFEF10 ........Jo'.....
0xBFFFEEC0: BFFFF310 BFFFF32F 00000000 BFFFF32F ..../......./...
0xBFFFEED0: B7321FF4 00000020 BFFFEF10 3331EFFC ..2. .........13
0xBFFFEEE0: B7271E76 BFFFEF10 B7321FF4 BFFFF02C v.'.......2.,...
0xBFFFEEF0: BFFFEF10 BFFFEFFC B7271E8D BFFFEF10 ..........'.....
.....
0xBFFFF2F0: 0809FFD4 B70DCCB0 BFFFF358 B70DB4F9 ........X.......
0xBFFFF300: 080D41CC BFFFF330 BFFFF32C B7DB02E0 .A..0...,.......
0xBFFFF310: 65687445 74656E72 33312F31 00000000 Ethernet1/13....
0xBFFFF320: 00000000 00000000 00000000 00000000 ................
0xBFFFF330: 00000000 00000000 00000000 00000000 ................
0xBFFFF340: 080E0000 00000000 BFFFF368 0807055D ........h...]...
0xBFFFF350: 00000001 00000000 00000000 00
...
-full output omitted-

In this case it was an ESX host on Eth1/13 sending malformed CDP packets which caused the problem.

I reported this to Cisco-TAC. Lucky for everyone else this was resolved in the the NX-OS release that came out in December.

I hope this post was informative to assist others in finding similar or new problems.

Advertisements

2 thoughts on “Troubleshooting random Nexus reboots

Please leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s