I was asked recently to share some knowledge about the support of the Cisco 6500 switches as the information available on the DOC-CD could be fairly overwhelming.
As it happens a clients Cisco-6509 switch fell over yesterday. I was called out to address the issue of the Cisco-6509 that decided it was tired of life by rebooting itself. I’ll go through some of the steps I did to find the root cause. Obviously note the steps listed here will not find the cause of every possible issue with a 6500 switch, but can be used as a guideline.
Usually the first thing I would do is to see the reason for the reboot with a “sh version”. Look at the highlighted lines.
ndcbbnpendc0103#sh ver Cisco Internetwork Operating System Software IOS (tm) s72033_rp Software (s72033_rp-ADVENTERPRISEK9_WAN-M), Version 12.2(18)SXF6, RELEASE SOFTWARE (fc1) Technical Support: http://www.cisco.com/techsupport Copyright (c) 1986-2006 by cisco Systems, Inc. Compiled Mon 18-Sep-06 23:32 by tinhuang Image text-base: 0x40101040, data-base: 0x42D90000 ROM: System Bootstrap, Version 12.2(17r)SX5, RELEASE SOFTWARE (fc1) BOOTLDR: s72033_rp Software (s72033_rp-ADVENTERPRISEK9_WAN-M), Version 12.2(18)SXF6, RELEASE SOFTWARE (fc1) ndcbbnpendc0103 uptime is 3 hours, 23 minutes Time since ndcbbnpendc0103 switched to active is 3 hours, 22 minutes System returned to ROM by s/w reset at 00:14:27 PDT Wed Sep 20 2006 (SP by bus error at PC 0x402DC89C, address 0x0) System restarted at 09:13:44 ZA Wed Mar 10 2010 System image file is "disk0:s72033-adventerprisek9_wan-mz.122-18.SXF6.bin"
Obviously it is clear that the switch did a software reset caused by ‘bus error at PC 0x402DC89C, address 0x0‘.
You can see it was caused by a system bus error. A system encounters a bus error when the processor tries to access a memory location that either does not exist (software) or does not respond properly (hardware). The memory location that this router tried to access was ‘0x0’. Do not confuse this with the program counter (PC) value above. With the address accessed by the router when the bus error occurred, the command “show region” could be used to determine the memory location the address corresponds to.
If the address falls within one of the ranges in the “show region” output, it means that the router was accessing a valid memory address, but the hardware corresponding to that address is not responding properly. This would indicate a hardware problem.
If the address reported by the bus error, does not fall within the ranges displayed in the “show region” output, it means that the router was trying to access an address that is not valid. This indicates that it is a Cisco IOS Software problem. From the output below it is clear that ‘0x0’ does not any memory region.
ndcbbnpendc0103#sh region Region Manager: Start End Size(b) Class Media Name 0x08000000 0x0BFFFFFF 67108864 Iomem R/W iomem 0x08B69D40 0x08C77813 1104596 Criti R/W iomem:Critical I/O 0x40000000 0x4BFFFFFF 201326592 Local R/W main 0x40101040 0x42D8FFFF 46723008 IText R/O main:text 0x42D90000 0x430A83BF 3244992 IData R/W main:data 0x430A83C0 0x44AAE4DF 27287840 IBss R/W main:bss 0x44AAE4E0 0x4BFFFFFF 123018016 Local R/W main:heap 0x50000000 0x7FFF7FFF 805273600 Local R/W more heap 0x52A11DC0 0x538A42EB 15279404 Criti R/W more heap:Critical Processor 0x80000000 0x8BFFFFFF 201326592 Local R/W main:(main_k0) 0xA0000000 0xABFFFFFF 201326592 Local R/W main:(main_k1)
The output of the “show stacks” command could then be used to identify the Cisco IOS Software bug that caused the bus error. It might be a bit overwhelming with all the garbish it spits out, but you will get used to the output soon enough. Alternatively you can use Cisco Output Interpreter to decode the output. I have posted the relevant portion here:
<pre>ndcbbnpendc0103#sh stack --omitted--</pre> Mar 9 17:01:58: %DIAG-SP-6-BYPASS: Module 4: Diagnostics is bypassed Mar 9 17:01:58: %OIR-SP-6-INSCARD: Card inserted in slot 4, interfaces are now online Mar 10 09:10:56: %C6K_PLATFORM-SP-2-PEER_RESET: SP is being reset by the RP %Software-forced reload Breakpoint exception, CPU signal 23, PC = 0x402DC89C -Traceback= 402DC89C 402DA828 40435C38 40436DF8 404243B8 40424510 402CF4DC --omitted--
If you Google the PLATFORM-SP-2 error, you should find the following :
Condition: Relates to WS-SUP720-3B running Cisco IOS Release 12.2(18)SXF2. The trigger for the crash is unknown.
Workaround: There is no workaround.
What have we established so far?
- A system bus error occurred when the processor tried accessing something that doesn’t exist. This points to a bug with current IOS version the switch is running.
- According to the bug description the trigger that caused the crash is unknown. And there is no published workaround.
Where to from here? Obviously it is safe to assume that an IOS upgrade should rectify the problem. But in production life is not that simple or quick. To upgrade the IOS of an in-production device usually requires some painful process for change control which can take some time.
How do you prevent this from happening again until the IOS upgrade? Well you need to know what triggered the IOS bug causing the 6500 to go belly up. For this the crashinfo is vital. The crashinfo should tell us exactly what happened right before the software reload. Again the output from this can be overwhelming.
Using the command “more {location}:{crashfile}, you will see a of list commands and logging events that happened. What you looking for the very last event usually before the traceback. Look at the highlighted lines:
ndcbbnpendc0103#more bootflash:crashinfo_20100310-071049 --omitted-- CMD: 'sh crypto isakmp sa vrf vpnafg' 09:10:11 ZA Wed Mar 10 2010 CMD: ' sh run int vlan1188' 09:10:34 ZA Wed Mar 10 2010 CMD: 'conf t' 09:10:37 ZA Wed Mar 10 2010 CMD: 'interface Vlan1188' 09:10:42 ZA Wed Mar 10 2010 CMD: 'no crypto map vpnafg-dtt-map redundancy VPNHA' 09:10:49 ZA Wed Mar 10 2010 Address Error (load or instruction fetch) exception, CPU signal 10, PC = 0x42172D1C -Traceback= 42172D1C 42173324 4217341C 4217348C 42173DE4 4217B710 4217B470 4114FAD8 4113CEC4 4045C740 41047A68 4046AB48 4102F70C 4102F6F8 $0 : 00000000, AT : 430A0000, v0 : 53A8EF94, v1 : 00000000 a0 : 5283DE80, a1 : 291C4A3D, a2 : 0D0D0D0D, a3 : 410156BC t0 : 00000010, t1 : 00000010, t2 : 00000000, t3 : FFFF00FF t4 : 41D4CA58, t5 : 458AEDC8, t6 : 458AEDC4, t7 : 458AEDC0 s0 : 5310C4B4, s1 : 483DCC00, s2 : 483DCBF0, s3 : 00000010 s4 : 458C4CA4, s5 : 00000000, s6 : 00000000, s7 : 00000000 t8 : 44AC1848, t9 : 00000000, k0 : 475ACDB0, k1 : 41D52C68 gp : 430AE700, sp : 54327600, s8 : 43800000, ra : 42173324 EPC : 42172D1C, ErrorEPC : BFC2A65C, SREG : 3400F103 MDLO : 00000002, MDHI : 1D59CAA0, BadVaddr : 0D0D0D19 Cause 80000C10 (Code 0x4): Address Error (load or instruction fetch) exception --omitted--
Here you can see clearly when the crypto map was removed of interface VLAN-1188, the 6500’s IOS choked. To prevent this, either lock that command via TACACS or instruct everyone not to use that command again, until such time that the IOS gets upgraded.
I hope this provides a good insight how to deal with a 6500 crash :)
Great Explanation as always Ruhann!
I see your point SXF6 is a couple of years old. I feel your pain trying to upgrade the old code to Safe Harbor.
thanks buddy :)
To be exact its from 2006, and riddled with IOS bugs
Nice explanation. Can you please advise how you learnt this information. I don’t believe there is a whitepaper that explains it so simply. There are quite a few other troubleshooting commands which we are always told only Cisco Tac can debug. It would be neat to know and interpret these commands and if there is a resource out there that gives this information it would be very helpful. thx
Thanks :)
Unfortunately there is no magic site I used other than CCO. Would be nice if there was. What I know is what I encountered in the last 4 years. This is the biggest reason I blog, to share knowledge like so many others out there :)
Great article, thanks!
Pls. give me update..
Please advise what update you waiting for?
Very good info, Thank you very much.
I would like to give you my 6509 crash logs which happened recently, can you please help me to understand it.
———————————————-
*Jun 27 18:40:53: %FABRIC-SP-5-CLEAR_BLOCK: Clear block option is off for the fabric in slot 5.
*Jun 27 18:40:53: %FABRIC-SP-5-FABRIC_MODULE_ACTIVE: The Switch Fabric Module in slot 5 became active.
*Jun 27 18:40:54: %CPU_MONITOR-3-PEER_EXCEPTION: CPU_MONITOR peer has failed due to exception , reset by [5/0]
%Software-forced reload
Early Notification of crash condition..
18:40:55 GMT Mon Jun 27 2011: Breakpoint exception, CPU signal 23, PC = 0x428142FC
———————————————-
Can you please help me to identify the reason of crash please… I tried to understand but i fail :(
Thanks,
Manjunath
pleasure.
send me an email, with the following details in a txt file and I will have a look :)
#sh ver
#sh log
#sh stack
#sh region
#more bootflash:{crash-file}
what can I do as I have core 6500 when power on load ios then reloaded to rommon mode.
boot bootflash:s72033-ipbase-mz.151-1.SY.bin
Router#
Router#
Router#
Router#
Router#
Router#
%Software-forced reload
Early Notification of crash condition..
00:01:36 UTC Sat Jan 1 2000: Breakpoint exception, CPU signal 23, PC = 0x414317
8C
——————————————————————–
Possible software fault. Upon reccurence, please collect
crashinfo, “show tech” and contact Cisco Technical Support.
——————————————————————–
-Traceback= 4143178C 4142FC18 40DF08BC 42057828 420594A8 40FBD95C 40FBDB90 41424
524
$0 : 00000000, AT : 43DC0000, v0 : 00000000, v1 : 00000000
a0 : 510E2D2C, a1 : 43225ADD, a2 : 00000000, a3 : 00000000
t0 : 405D6720, t1 : 3400F101, t2 : 3400C100, t3 : FFFF00FF
t4 : 45880000, t5 : 2E000000, t6 : 00000000, t7 : FFFFFFFF
s0 : 00000000, s1 : 43AD0000, s2 : 472F1530, s3 : 43A10000
s4 : 471AC89C, s5 : 00000002, s6 : 00000000, s7 : 00000000
t8 : 00000000, t9 : 00000000, k0 : 00000000, k1 : 00000000
gp : 43DC6A70, sp : 5000DCB0, s8 : 00000000, ra : 4142FC18
EPC : 4143178C, ErrorEPC : BFC00000, SREG : 3400F103
MDLO : 000043CC, MDHI : 00000000, BadVaddr : 00000000
TEXT_START : 0x4010F758
DATA_START : 0x4389E740
Cause 00000824 (Code 0x9): Breakpoint exception
Writing crashinfo to bootflash:crashinfo_RP_20000101-000136-UTC
=== Flushing messages (00:01:36 UTC Sat Jan 1 2000) ===
Buffered messages:
*Jan 1 00:00:21.511: %IFMGR-7-NO_IFINDEX_FILE: Unable to open nvram:/ifIndex-ta
ble No such file or dire
*Jan 1 00:00:22.327: RP: Currently running ROMMON from S (Gold) region
*Jan 1 00:00:48.791: %SYS-5-RESTART: System restarted —
Cisco IOS Software, s72033_rp Software (s72033_rp-IPBASE-M), Version 15.1(1)SY,
RELEASE SOFTWARE (fc2)
Technical Support: http://www.cisco.com/techsupport
Copyright (c) 1986-2012 by Cisco Systems, Inc.
Compiled Tue 09-Oct-12 15:13 by prod_rel_team
*Jan 1 00:00:08.303: %SYS-SP-3-LOGGER_FLUSHED: System was paused for 00:00:00 t
o ensure console debugging output.
*Jan 1 00:00:19.915: SP: SP: Currently running ROMMON from S (Gold) regionin
Loading image, pl
*Jan 1 00:00:25.427: %C6K_PLATFORM-SP-4-CONFREG_BREAK_ENABLED: The default fact
Stack pointer : 0x8FFFFF80
ory setting for config register is 0x2102.It is advisable to retain 1 in 0x2102
monra : 0xBFC26C54
edata :
as it prevents returning to ROMMON when break is issued.
memsize : 0x10000000
*Jan 1 00:00:47.999: %SYS-SP-5-RESTART: System restarted —
comp_size : 0x05B38A00
Cisco IOS Software, s72033_sp Software (s72033_sp-IPBASE-M), Version 15.1(1)SY,
uncomp_checksum : 0xDA7140AB
*Jan 1 00:00:49.339: %C6KPWR-SP-4-PSOK: power supply 2 turned on.
*Jan 1 00:00:49.543: %C6KPWR-SP-4-PSREDUNDANTBOTHSUPPLY: in power-redundancy mo
de, system is operating on both power supplies.
*Jan 1 00:00:57.383: %FABRIC-SP-5-FABRIC_MODULE_ACTIVE: The Switch Fabric Modul
e in slot 6 became active.
*Jan 1 00:00:59.779: %DIAG-SP-6-RUN_MINIMUM: Module 6: Running Minimal Diagnost
ics…
*Jan 1 00:01:22.127: %SYS-5-CONFIG_I: Configured from console by console
Queued messages:
*Jan 1 00:01:36.891: %SYS-3-LOGGER_FLUSHING: System pausing to ensure console d
ebugging output.
*Jan 1 00:01:36.819: %CPU_MONITOR-3-PEER_EXCEPTION: CPU_MONITOR peer has failed
due to exception , reset by [6/0]
*** System received a Software forced crash ***
signal= 0x17, code= 0x24, context= 0x458822bc
PC = 0x41424d4c, SP = 0x43acbf18, RA = 0x41432374
Cause Reg = 0x00003820, Status Reg = 0x34008002
rommon 1 >
rommon 1 >
rommon 1 >
rommon 1 >
Couple things to check:
– Confirm the 15.0 code you loading is compatible with you Sup module
– What is the config-register set to?
– Have a look a at the crash information file (more file:)
Thanks. It was very helpful :)
Hi i have faced similar issue in my organistaion
XXXXXXXX uptime is 1 hour, 3 minutes
Uptime for this control processor is 1 hour, 4 minutes
Time since XXXXXXXXXX switched to active is 1 hour, 2 minutes
System returned to ROM by s/w reset at 14:18:34 CEST Tue Jul 8 2014 (SP by bus error at PC 0x418A409C, address 0xEF4321CD)
System restarted at 14:22:07 CEST Tue Jul 8 2014
System image file is “sup-bootdisk:s72033-ipbasek9-mz.122-33.SXI6.bin”
last known command executed before software reload
CMD: ‘5 permit udp 192.168.254.0 0.0.0.255 eq bootpc addrgroup DHCP_SERVERS eq bootps ‘ 14:16:58 CEST Tue Jul 8 2014
CMD: ‘5 permit udp 192.168.254.0 0.0.0.255 eq bootpc addrgroup DHCP_SERVERS eq bootps ‘ 14:16:58 CEST Tue Jul 8 2014
CMD: ‘no 5’ 14:17:37 CEST Tue Jul 8 2014
CMD: ‘no 5’ 14:17:37 CEST Tue Jul 8 2014
CMD: ‘5 permit udp 192.168.254.0 0.0.0.255 eq bootpc any eq bootps ‘ 14:17:48 CEST Tue Jul 8 2014
CMD: ‘5 permit udp 192.168.254.0 0.0.0.255 eq bootpc any eq bootps ‘ 14:17:48 CEST Tue Jul 8 2014
CMD: ‘no 5’ 14:18:24 CEST Tue Jul 8 2014
Jul 8 14:18:24 CEST: %VSLP-SW1_SPSTBY-3-VSLP_LMP_FAIL_REASON: Te1/5/4: Link down
Jul 8 14:18:24 CEST: %VSL-SW1_SPSTBY-5-VSL_CNTRL_LINK: New VSL Control Link Te1/5/5
Jul 8 14:18:24 CEST: %VSLP-SW1_SPSTBY-3-VSLP_LMP_FAIL_REASON: Te1/5/5: Link down
Jul 8 14:18:24 CEST: %VSLP-SW1_SPSTBY-2-VSL_DOWN: Last VSL interface Te1/5/5 went down
Jul 8 14:18:24 CEST: %SATVS_IBC-SW1_SPSTBY-5-VSL_DOWN_SCP_DROP: VSL inactive – dropping cached SCP packet: (SA/DA:0x4/0x4, SSAP/DSAP:0x2/0x1,
OP/SEQ:0x1E/0xFFF0, SIG/INFO:0x1/0x501, eSA:0000.0500.0000)
Jul 8 14:18:25 CEST: %VSLP-SW1_SPSTBY-2-VSL_DOWN: All VSL links went down while switch is in Standby role
Jul 8 14:18:25 CEST: %DUAL_ACTIVE-SW1_SPSTBY-1-VSL_DOWN: VSL is down – switchover, or possible dual-active situation has occurred
Jul 8 14:18:25 CEST: %PFREDUN-SW1_SPSTBY-6-ACTIVE: Initializing as Virtual Switch ACTIVE processor
Jul 8 14:18:26 CEST: %SYS-6-LOGGINGHOST_STARTSTOP: Logging to host 10.213.197.22 port 514 started – CLI initiated
Jul 8 14:18:26 CEST: %SYS-6-LOGGINGHOST_STARTSTOP: Logging to host 10.213.197.23 port 514 started – CLI initiated
Jul 8 14:18:26 CEST: %SYS-6-LOGGINGHOST_STARTSTOP: Logging to host 193.28.206.208 port 514 started – CLI initiated
Jul 8 14:18:26 CEST: %SYS-6-LOGGINGHOST_STARTSTOP: Logging to host 193.28.206.209 port 514 started – CLI initiated
Jul 8 14:18:26 CEST: %SYS-SW1_SPSTBY-3-LOGGER_FLUSHED: System was paused for 00:00:01 to ensure console debugging output.
Jul 8 14:18:27 CEST: %C6KPWR-SP-4-PSOK: power supply 1 turned on.
Jul 8 14:18:27 CEST: %C6KPWR-SP-4-PSOK: power supply 2 turned on.
Jul 8 14:18:28 CEST: %C6K_PLATFORM-2-PEER_RESET: RP is being reset by the SP
%Software-forced reload
14:18:28 CEST Tue Jul 8 2014: Breakpoint exception, CPU signal 23, PC = 0x411CAAF4
——————————————————————–
Possible software fault. Upon reccurence, please collect
crashinfo, “show tech” and contact Cisco Technical Support.
——————————————————————–
-Traceback= 411CAAF4 411C8648 40A088C4 40A09F48 40D95984 40D95ADC 411BD96C
$0 : 00000000, AT : 43C70000, v0 : 45670000, v1 : 00000000
a0 : 50AB8E20, a1 : 00000005, a2 : 00000000, a3 : 00000000
t0 : 411BDAB0, t1 : 34008101, t2 : 411BDAD8, t3 : FFFF00FF
t4 : 411BDAB0, t5 : 500122E8, t6 : 00000001, t7 : B62DCFEF
s0 : 00000000, s1 : 43940000, s2 : 00000000, s3 : 438B0000
s4 : 438B0000, s5 : 438B0000, s6 : 42DC0000, s7 : 00000001
t8 : 5001234C, t9 : 00000000, k0 : 00000000, k1 : 00000000
gp : 43C7A1AC, sp : 50012418, s8 : 42DC0000, ra : 411C8648
EPC : 411CAAF4, ErrorEPC : FB3BE8C9, SREG : 34008103
MDLO : 00000000, MDHI : 00000000, BadVaddr : 00000000
DATA_START : 0x43743070
Cause 00000C24 (Code 0x9): Breakpoint exception
Great presentation.