Troubleshooting a Cisco 6500 crash

I was asked recently to share some knowledge about the support of the Cisco 6500 switches as the information available on the DOC-CD could be fairly overwhelming.

As it happens a clients Cisco-6509 switch fell over yesterday. I was called out to address the issue of the Cisco-6509 that decided it was tired of life by rebooting itself.  I’ll go through some of the steps I did to find the root cause. Obviously note the steps listed here will not find the cause of every possible issue with a 6500 switch, but can be used as a guideline.

Usually the first thing I would do is to see the reason for the reboot with a “sh version”. Look at the highlighted lines.

ndcbbnpendc0103#sh ver
Cisco Internetwork Operating System Software
IOS (tm) s72033_rp Software (s72033_rp-ADVENTERPRISEK9_WAN-M), Version 12.2(18)SXF6, RELEASE SOFTWARE (fc1)
Technical Support: http://www.cisco.com/techsupport
Copyright (c) 1986-2006 by cisco Systems, Inc.
Compiled Mon 18-Sep-06 23:32 by tinhuang
Image text-base: 0x40101040, data-base: 0x42D90000

ROM: System Bootstrap, Version 12.2(17r)SX5, RELEASE SOFTWARE (fc1)
BOOTLDR: s72033_rp Software (s72033_rp-ADVENTERPRISEK9_WAN-M), Version 12.2(18)SXF6, RELEASE SOFTWARE (fc1)

ndcbbnpendc0103 uptime is 3 hours, 23 minutes
Time since ndcbbnpendc0103 switched to active is 3 hours, 22 minutes
System returned to ROM by s/w reset at 00:14:27 PDT Wed Sep 20 2006 (SP by bus error at PC 0x402DC89C, address 0x0)
System restarted at 09:13:44 ZA Wed Mar 10 2010
System image file is "disk0:s72033-adventerprisek9_wan-mz.122-18.SXF6.bin"

Obviously it is clear that the switch did a software reset caused by ‘bus error at PC 0x402DC89C, address 0x0‘.

You can see it was caused by a system bus error. A system encounters a bus error when the processor tries to access a memory location that either does not exist (software) or does not respond properly (hardware). The memory location that this router tried to access was ‘0x0’. Do not confuse this with the program counter (PC) value above. With the address accessed by the router when the bus error occurred, the command “show region” could be used to determine the memory location the address corresponds to.

If the address falls within one of the ranges in the “show region” output, it means that the router was accessing a valid memory address, but the hardware corresponding to that address is not responding properly. This would indicate a hardware problem.

If the address reported by the bus error, does not fall within the ranges displayed in the “show region” output, it means that the router was trying to access an address that is not valid. This indicates that it is a Cisco IOS Software problem. From the output below it is clear that ‘0x0’ does not any memory region.

ndcbbnpendc0103#sh region
Region Manager:
      Start         End     Size(b)  Class  Media  Name
 0x08000000  0x0BFFFFFF    67108864  Iomem  R/W    iomem
 0x08B69D40  0x08C77813     1104596  Criti  R/W    iomem:Critical I/O
 0x40000000  0x4BFFFFFF   201326592  Local  R/W    main
 0x40101040  0x42D8FFFF    46723008  IText  R/O    main:text
 0x42D90000  0x430A83BF     3244992  IData  R/W    main:data
 0x430A83C0  0x44AAE4DF    27287840  IBss   R/W    main:bss
 0x44AAE4E0  0x4BFFFFFF   123018016  Local  R/W    main:heap
 0x50000000  0x7FFF7FFF   805273600  Local  R/W    more heap
 0x52A11DC0  0x538A42EB    15279404  Criti  R/W    more heap:Critical Processor
 0x80000000  0x8BFFFFFF   201326592  Local  R/W    main:(main_k0)
 0xA0000000  0xABFFFFFF   201326592  Local  R/W    main:(main_k1)

The output of the “show stacks” command could then be used to identify the Cisco IOS Software bug that caused the bus error. It might be a bit overwhelming with all the garbish it spits out, but you will get used to the output soon enough. Alternatively you can use Cisco Output Interpreter to decode the output. I have posted the relevant portion here:

<pre>ndcbbnpendc0103#sh stack
--omitted--</pre>
Mar  9 17:01:58: %DIAG-SP-6-BYPASS: Module 4: Diagnostics is bypassed
Mar  9 17:01:58: %OIR-SP-6-INSCARD: Card inserted in slot 4, interfaces are now online
Mar 10 09:10:56: %C6K_PLATFORM-SP-2-PEER_RESET: SP is being reset by the RP

%Software-forced reload

Breakpoint exception, CPU signal 23, PC = 0x402DC89C

-Traceback= 402DC89C 402DA828 40435C38 40436DF8 404243B8 40424510 402CF4DC
--omitted--

If you Google the PLATFORM-SP-2 error, you should find the following :
Condition: Relates to WS-SUP720-3B running Cisco IOS Release 12.2(18)SXF2. The trigger for the crash is unknown.
Workaround: There is no workaround.

What have we established so far?

  • A system bus error occurred when the processor tried accessing something that doesn’t exist. This points to a bug with current IOS version the switch is running.
  • According to the bug description the trigger that caused the crash is unknown. And there is no published workaround.

Where to from here? Obviously it is safe to assume that an IOS upgrade should rectify the problem.  But in production life is not that simple or quick. To upgrade the IOS of an in-production device usually requires some painful process for change control which can take some time.

How do you prevent this from happening again until the IOS upgrade? Well you need to know what triggered the IOS bug causing the 6500 to go belly up. For this the crashinfo is vital. The crashinfo should tell us exactly what happened right before the software reload. Again the output from this can be overwhelming.

Using the command “more {location}:{crashfile}, you will see a of list commands and logging events that happened. What you looking for the very last event usually before the traceback. Look at the highlighted lines:

ndcbbnpendc0103#more bootflash:crashinfo_20100310-071049
--omitted--
CMD: 'sh crypto isakmp sa vrf vpnafg' 09:10:11 ZA Wed Mar 10 2010
CMD: ' sh run int vlan1188' 09:10:34 ZA Wed Mar 10 2010
CMD: 'conf t' 09:10:37 ZA Wed Mar 10 2010
CMD: 'interface Vlan1188' 09:10:42 ZA Wed Mar 10 2010
CMD: 'no crypto map vpnafg-dtt-map redundancy VPNHA' 09:10:49 ZA Wed Mar 10 2010

Address Error (load or instruction fetch) exception, CPU signal 10, PC = 0x42172D1C

-Traceback= 42172D1C 42173324 4217341C 4217348C 42173DE4 4217B710 4217B470 4114FAD8 4113CEC4 4045C740 41047A68 4046AB48 4102F70C 4102F6F8
$0 : 00000000, AT : 430A0000, v0 : 53A8EF94, v1 : 00000000
a0 : 5283DE80, a1 : 291C4A3D, a2 : 0D0D0D0D, a3 : 410156BC
t0 : 00000010, t1 : 00000010, t2 : 00000000, t3 : FFFF00FF
t4 : 41D4CA58, t5 : 458AEDC8, t6 : 458AEDC4, t7 : 458AEDC0
s0 : 5310C4B4, s1 : 483DCC00, s2 : 483DCBF0, s3 : 00000010
s4 : 458C4CA4, s5 : 00000000, s6 : 00000000, s7 : 00000000
t8 : 44AC1848, t9 : 00000000, k0 : 475ACDB0, k1 : 41D52C68
gp : 430AE700, sp : 54327600, s8 : 43800000, ra : 42173324
EPC  : 42172D1C, ErrorEPC : BFC2A65C, SREG     : 3400F103
MDLO : 00000002, MDHI     : 1D59CAA0, BadVaddr : 0D0D0D19
Cause 80000C10 (Code 0x4): Address Error (load or instruction fetch) exception
--omitted--

Here you can see clearly when the crypto map was removed of interface VLAN-1188, the 6500’s IOS choked. To prevent this, either lock that command via TACACS or instruct everyone not to use that command again, until such time that the IOS gets upgraded.

I hope this provides a good insight how to deal with a 6500 crash :)

Advertisements

14 thoughts on “Troubleshooting a Cisco 6500 crash

  1. Great Explanation as always Ruhann!

    I see your point SXF6 is a couple of years old. I feel your pain trying to upgrade the old code to Safe Harbor.

  2. Nice explanation. Can you please advise how you learnt this information. I don’t believe there is a whitepaper that explains it so simply. There are quite a few other troubleshooting commands which we are always told only Cisco Tac can debug. It would be neat to know and interpret these commands and if there is a resource out there that gives this information it would be very helpful. thx

    1. Thanks :)
      Unfortunately there is no magic site I used other than CCO. Would be nice if there was. What I know is what I encountered in the last 4 years. This is the biggest reason I blog, to share knowledge like so many others out there :)

  3. Very good info, Thank you very much.
    I would like to give you my 6509 crash logs which happened recently, can you please help me to understand it.

    ———————————————-
    *Jun 27 18:40:53: %FABRIC-SP-5-CLEAR_BLOCK: Clear block option is off for the fabric in slot 5.
    *Jun 27 18:40:53: %FABRIC-SP-5-FABRIC_MODULE_ACTIVE: The Switch Fabric Module in slot 5 became active.
    *Jun 27 18:40:54: %CPU_MONITOR-3-PEER_EXCEPTION: CPU_MONITOR peer has failed due to exception , reset by [5/0]

    %Software-forced reload

    Early Notification of crash condition..

    18:40:55 GMT Mon Jun 27 2011: Breakpoint exception, CPU signal 23, PC = 0x428142FC
    ———————————————-

    Can you please help me to identify the reason of crash please… I tried to understand but i fail :(

    Thanks,
    Manjunath

    1. pleasure.
      send me an email, with the following details in a txt file and I will have a look :)

      #sh ver
      #sh log
      #sh stack
      #sh region
      #more bootflash:{crash-file}

  4. what can I do as I have core 6500 when power on load ios then reloaded to rommon mode.

    boot bootflash:s72033-ipbase-mz.151-1.SY.bin

    Router#
    Router#
    Router#
    Router#
    Router#
    Router#

    %Software-forced reload

    Early Notification of crash condition..

    00:01:36 UTC Sat Jan 1 2000: Breakpoint exception, CPU signal 23, PC = 0x414317
    8C

    ——————————————————————–
    Possible software fault. Upon reccurence, please collect
    crashinfo, “show tech” and contact Cisco Technical Support.
    ——————————————————————–

    -Traceback= 4143178C 4142FC18 40DF08BC 42057828 420594A8 40FBD95C 40FBDB90 41424
    524
    $0 : 00000000, AT : 43DC0000, v0 : 00000000, v1 : 00000000
    a0 : 510E2D2C, a1 : 43225ADD, a2 : 00000000, a3 : 00000000
    t0 : 405D6720, t1 : 3400F101, t2 : 3400C100, t3 : FFFF00FF
    t4 : 45880000, t5 : 2E000000, t6 : 00000000, t7 : FFFFFFFF
    s0 : 00000000, s1 : 43AD0000, s2 : 472F1530, s3 : 43A10000
    s4 : 471AC89C, s5 : 00000002, s6 : 00000000, s7 : 00000000
    t8 : 00000000, t9 : 00000000, k0 : 00000000, k1 : 00000000
    gp : 43DC6A70, sp : 5000DCB0, s8 : 00000000, ra : 4142FC18
    EPC : 4143178C, ErrorEPC : BFC00000, SREG : 3400F103
    MDLO : 000043CC, MDHI : 00000000, BadVaddr : 00000000
    TEXT_START : 0x4010F758
    DATA_START : 0x4389E740
    Cause 00000824 (Code 0x9): Breakpoint exception

    Writing crashinfo to bootflash:crashinfo_RP_20000101-000136-UTC

    === Flushing messages (00:01:36 UTC Sat Jan 1 2000) ===

    Buffered messages:

    *Jan 1 00:00:21.511: %IFMGR-7-NO_IFINDEX_FILE: Unable to open nvram:/ifIndex-ta
    ble No such file or dire
    *Jan 1 00:00:22.327: RP: Currently running ROMMON from S (Gold) region
    *Jan 1 00:00:48.791: %SYS-5-RESTART: System restarted —
    Cisco IOS Software, s72033_rp Software (s72033_rp-IPBASE-M), Version 15.1(1)SY,
    RELEASE SOFTWARE (fc2)
    Technical Support: http://www.cisco.com/techsupport
    Copyright (c) 1986-2012 by Cisco Systems, Inc.
    Compiled Tue 09-Oct-12 15:13 by prod_rel_team
    *Jan 1 00:00:08.303: %SYS-SP-3-LOGGER_FLUSHED: System was paused for 00:00:00 t
    o ensure console debugging output.
    *Jan 1 00:00:19.915: SP: SP: Currently running ROMMON from S (Gold) regionin
    Loading image, pl
    *Jan 1 00:00:25.427: %C6K_PLATFORM-SP-4-CONFREG_BREAK_ENABLED: The default fact
    Stack pointer : 0x8FFFFF80
    ory setting for config register is 0x2102.It is advisable to retain 1 in 0x2102
    monra : 0xBFC26C54
    edata :
    as it prevents returning to ROMMON when break is issued.
    memsize : 0x10000000
    *Jan 1 00:00:47.999: %SYS-SP-5-RESTART: System restarted —
    comp_size : 0x05B38A00
    Cisco IOS Software, s72033_sp Software (s72033_sp-IPBASE-M), Version 15.1(1)SY,
    uncomp_checksum : 0xDA7140AB
    *Jan 1 00:00:49.339: %C6KPWR-SP-4-PSOK: power supply 2 turned on.
    *Jan 1 00:00:49.543: %C6KPWR-SP-4-PSREDUNDANTBOTHSUPPLY: in power-redundancy mo
    de, system is operating on both power supplies.
    *Jan 1 00:00:57.383: %FABRIC-SP-5-FABRIC_MODULE_ACTIVE: The Switch Fabric Modul
    e in slot 6 became active.
    *Jan 1 00:00:59.779: %DIAG-SP-6-RUN_MINIMUM: Module 6: Running Minimal Diagnost
    ics…
    *Jan 1 00:01:22.127: %SYS-5-CONFIG_I: Configured from console by console
    Queued messages:
    *Jan 1 00:01:36.891: %SYS-3-LOGGER_FLUSHING: System pausing to ensure console d
    ebugging output.

    *Jan 1 00:01:36.819: %CPU_MONITOR-3-PEER_EXCEPTION: CPU_MONITOR peer has failed
    due to exception , reset by [6/0]
    *** System received a Software forced crash ***
    signal= 0x17, code= 0x24, context= 0x458822bc
    PC = 0x41424d4c, SP = 0x43acbf18, RA = 0x41432374
    Cause Reg = 0x00003820, Status Reg = 0x34008002
    rommon 1 >
    rommon 1 >
    rommon 1 >
    rommon 1 >

    1. Couple things to check:
      – Confirm the 15.0 code you loading is compatible with you Sup module
      – What is the config-register set to?
      – Have a look a at the crash information file (more file:)

  5. Hi i have faced similar issue in my organistaion

    XXXXXXXX uptime is 1 hour, 3 minutes
    Uptime for this control processor is 1 hour, 4 minutes
    Time since XXXXXXXXXX switched to active is 1 hour, 2 minutes
    System returned to ROM by s/w reset at 14:18:34 CEST Tue Jul 8 2014 (SP by bus error at PC 0x418A409C, address 0xEF4321CD)
    System restarted at 14:22:07 CEST Tue Jul 8 2014
    System image file is “sup-bootdisk:s72033-ipbasek9-mz.122-33.SXI6.bin”

    last known command executed before software reload

    CMD: ‘5 permit udp 192.168.254.0 0.0.0.255 eq bootpc addrgroup DHCP_SERVERS eq bootps ‘ 14:16:58 CEST Tue Jul 8 2014
    CMD: ‘5 permit udp 192.168.254.0 0.0.0.255 eq bootpc addrgroup DHCP_SERVERS eq bootps ‘ 14:16:58 CEST Tue Jul 8 2014
    CMD: ‘no 5’ 14:17:37 CEST Tue Jul 8 2014
    CMD: ‘no 5’ 14:17:37 CEST Tue Jul 8 2014
    CMD: ‘5 permit udp 192.168.254.0 0.0.0.255 eq bootpc any eq bootps ‘ 14:17:48 CEST Tue Jul 8 2014
    CMD: ‘5 permit udp 192.168.254.0 0.0.0.255 eq bootpc any eq bootps ‘ 14:17:48 CEST Tue Jul 8 2014
    CMD: ‘no 5’ 14:18:24 CEST Tue Jul 8 2014
    Jul 8 14:18:24 CEST: %VSLP-SW1_SPSTBY-3-VSLP_LMP_FAIL_REASON: Te1/5/4: Link down
    Jul 8 14:18:24 CEST: %VSL-SW1_SPSTBY-5-VSL_CNTRL_LINK: New VSL Control Link Te1/5/5
    Jul 8 14:18:24 CEST: %VSLP-SW1_SPSTBY-3-VSLP_LMP_FAIL_REASON: Te1/5/5: Link down
    Jul 8 14:18:24 CEST: %VSLP-SW1_SPSTBY-2-VSL_DOWN: Last VSL interface Te1/5/5 went down
    Jul 8 14:18:24 CEST: %SATVS_IBC-SW1_SPSTBY-5-VSL_DOWN_SCP_DROP: VSL inactive – dropping cached SCP packet: (SA/DA:0x4/0x4, SSAP/DSAP:0x2/0x1,

    OP/SEQ:0x1E/0xFFF0, SIG/INFO:0x1/0x501, eSA:0000.0500.0000)
    Jul 8 14:18:25 CEST: %VSLP-SW1_SPSTBY-2-VSL_DOWN: All VSL links went down while switch is in Standby role
    Jul 8 14:18:25 CEST: %DUAL_ACTIVE-SW1_SPSTBY-1-VSL_DOWN: VSL is down – switchover, or possible dual-active situation has occurred
    Jul 8 14:18:25 CEST: %PFREDUN-SW1_SPSTBY-6-ACTIVE: Initializing as Virtual Switch ACTIVE processor
    Jul 8 14:18:26 CEST: %SYS-6-LOGGINGHOST_STARTSTOP: Logging to host 10.213.197.22 port 514 started – CLI initiated
    Jul 8 14:18:26 CEST: %SYS-6-LOGGINGHOST_STARTSTOP: Logging to host 10.213.197.23 port 514 started – CLI initiated
    Jul 8 14:18:26 CEST: %SYS-6-LOGGINGHOST_STARTSTOP: Logging to host 193.28.206.208 port 514 started – CLI initiated
    Jul 8 14:18:26 CEST: %SYS-6-LOGGINGHOST_STARTSTOP: Logging to host 193.28.206.209 port 514 started – CLI initiated
    Jul 8 14:18:26 CEST: %SYS-SW1_SPSTBY-3-LOGGER_FLUSHED: System was paused for 00:00:01 to ensure console debugging output.
    Jul 8 14:18:27 CEST: %C6KPWR-SP-4-PSOK: power supply 1 turned on.
    Jul 8 14:18:27 CEST: %C6KPWR-SP-4-PSOK: power supply 2 turned on.
    Jul 8 14:18:28 CEST: %C6K_PLATFORM-2-PEER_RESET: RP is being reset by the SP

    %Software-forced reload

    14:18:28 CEST Tue Jul 8 2014: Breakpoint exception, CPU signal 23, PC = 0x411CAAF4

    ——————————————————————–
    Possible software fault. Upon reccurence, please collect
    crashinfo, “show tech” and contact Cisco Technical Support.
    ——————————————————————–

    -Traceback= 411CAAF4 411C8648 40A088C4 40A09F48 40D95984 40D95ADC 411BD96C
    $0 : 00000000, AT : 43C70000, v0 : 45670000, v1 : 00000000
    a0 : 50AB8E20, a1 : 00000005, a2 : 00000000, a3 : 00000000
    t0 : 411BDAB0, t1 : 34008101, t2 : 411BDAD8, t3 : FFFF00FF
    t4 : 411BDAB0, t5 : 500122E8, t6 : 00000001, t7 : B62DCFEF

    s0 : 00000000, s1 : 43940000, s2 : 00000000, s3 : 438B0000
    s4 : 438B0000, s5 : 438B0000, s6 : 42DC0000, s7 : 00000001
    t8 : 5001234C, t9 : 00000000, k0 : 00000000, k1 : 00000000
    gp : 43C7A1AC, sp : 50012418, s8 : 42DC0000, ra : 411C8648
    EPC : 411CAAF4, ErrorEPC : FB3BE8C9, SREG : 34008103
    MDLO : 00000000, MDHI : 00000000, BadVaddr : 00000000
    DATA_START : 0x43743070
    Cause 00000C24 (Code 0x9): Breakpoint exception

Please leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s