Sorry for a second summary, but this one was good, so I thought I would forward it along as well. Thank you for the great explanation, Buddy. I'm sure a lot of people will benefit from this knowledge. - Dan ----- Forwarded by Dan Kelley/IC/SSMHC on 04/26/2002 02:20 PM ----- "Lumpkin, Buddy" <Buddy.Lumpkin@nordstrom.com> 04/25/2002 03:50 PM To: "'Dan_Kelley@ssmhc.com'" <Dan_Kelley@ssmhc.com> cc: Subject: RE: ecache parity error? Hi, I used to repair Sun Systems to component level and I would like to make a distinction ... There is the famed E-Cache error that you can read about on ZDnet and other news sources and that everyone knows about. It affects mid-range to high end Sun servers (E3500 and later) running 400+mhz cpu's. The problem is intermittent E-Cache errors on otherwise perfectly good cpu's under certain circumstances. They have addressed this with two solutions. The first solution after repeated problems is to replace the modules with a module from a different manufacturer. Sun calls these modules CTO modules. The next step if you still experience problems is to replace these modules with special ones. The special modules are "hacked" in such a way that the E-Cache chips are actually mirrored. The effect is that if an error occurs from a chip that they read is retried from the mirror. Sun code named these modules as "Sombra" modules. With customers they refer to these as the "Mirrored E-Cache" modules. The second part of the distinction is that E-Cache errors are a very common way that cpu chip errors manifest. The 4-400 VME style sun systems actually had seperate chips for E-Cache, along with an individual MMU, Page Map, Seg Map, Region Map, Integer Unit (heart of what we call a CPU these days), and a floating point unit. It's the last of it's kind. Any modern system has one big monolithic cpu with all of these other parts mentioned built in. Well, that's not entirely true, the E-Cache is still external, but sits on the module that you plug into the board. These are still one of the more common parts on the board to fail because they are the most expensive parts. E-Cache is usually the fastest memory that you can buy on the market (6 nano-second access times or better these days) so they are designed to run on the bleading edge. The E-Cache errors your experiencing on your Ultra 5 or 10 are in fact a symptom of a failing CPU, but are not the famed E-Cache blunder made by Sun that everyone talks about. Sorry for the long winded digression. --Buddy -----Original Message----- From: Dan_Kelley@ssmhc.com [mailto:Dan_Kelley@ssmhc.com] Sent: Wednesday, April 24, 2002 10:35 AM To: sunmanagers@sunmanagers.org Subject: ecache parity error? Hello, all. We have a machine that keeps crashing, and I think it is the ecache parity error. I have been waiting for it to happen again before I sent an e-mail to this list, though. Could anyone look at this and tell me if they think it is the ecache error? If not, any clues as to what it is? Thanks in advance! I will summarize. - Dan uname -a: SunOS netdev 5.8 Generic_108528-14 sun4u sparc SUNW,Ultra-5_10 I have tracked here is the info for the first one (note they are slightly different): echo '$c' | adb -k unix.1 vmcore.1: physmem 173a7 panicsys(104234b0,1040c198,10050068,78002000,57542400,c) + 44 vpanic(10050068,1040c198,16e76a3d8cac,10,30000689ea8,30000068438) + cc panic(10050068,804,1,1041a798,fffd,20) + 1c sync_handler(1041a980,10400000,0,0,0,2) + 150 prom_rtt(10000000,16,f0000000,16e7332a6da9,0,2) client_handler(f0066d2c,2a10007d6e8,1,104283d8,1,1041a980) + 2c prom_enter_mon(0,6,b,2a10004bd40,2a10007dd40,0) + 28 debug_enter(0,16e73315c8c5,16e73315c8c9,0,30000ddf1e8,0) + d0 kbdinput(1045a400,4d,30000689d68,300001b5000,0,1013dd4c) + 304 kbdrput(30000adabe8,30000f7e340,30000ad3a98,30000f7e340,30000689d68,30000ad3a20) + 13c putnext(30000adae48,30000ad9a90,30000adb0a8,30000f7e340,0,0) + 1cc async_softint(30000f7e340,1,ffff,20000,0,30000adae48) + 568 asysoftintr(3000017a008,30000b7e000,1,2a10007dd40,10180,1026fba8) + 70 intr_thread(2a10001fd40,1041b180,10423890,10423890,0,0) + a4 idle(1040f864,0,0,1041b180,3000005d6c8,0) + 54 thread_start(0,0,0,0,0,0) + 4 /var/adm/messages from this one: Apr 12 17:59:18 netdev SUNW,UltraSPARC-IIi: [ID 932869 kern.warning] WARNING: [AFT1] EDP event on CPU0 Data access at TL=0, errID 0x00015289.afcae2ba Apr 12 17:59:18 netdev AFSR 0x00000000.80400080<PRIV,EDP> AFAR 0x00000000.3d41fa68 Apr 12 17:59:18 netdev AFSR.PSYND 0x0080(Score 95) AFSR.ETS 0x00 Fault_PC 0x10031cc8 Apr 12 17:59:18 netdev UDBH 0x0000 UDBH.ESYND 0x00 UDBL 0x0000 UDBL.ESYND 0x00 Apr 12 17:59:18 netdev SUNW,UltraSPARC-IIi: [ID 683009 kern.info] [AFT2] errID 0x00015289.afcae2ba PA=0x00000000.3d41fa68 Apr 12 17:59:18 netdev E$tag 0x00000000.0003cf50 E$State: Modified E$parity 0x03 Badlines found=6 Apr 12 17:59:18 netdev SUNW,UltraSPARC-IIi: [ID 359263 kern.info] [AFT2] E$Data (0x00): 0x00000000.10041eb0 Apr 12 17:59:18 netdev SUNW,UltraSPARC-IIi: [ID 359263 kern.info] [AFT2] E$Data (0x08): 0x00000000.10041eb4 Apr 12 17:59:18 netdev SUNW,UltraSPARC-IIi: [ID 359263 kern.info] [AFT2] E$Data (0x10): 0x00000000.0247e008 Apr 12 17:59:18 netdev SUNW,UltraSPARC-IIi: [ID 359263 kern.info] [AFT2] E$Data (0x18): 0x00000000.10423890 Apr 12 17:59:18 netdev SUNW,UltraSPARC-IIi: [ID 359263 kern.info] [AFT2] E$Data (0x20): 0x00000000.10041eb0 Apr 12 17:59:18 netdev SUNW,UltraSPARC-IIi: [ID 989652 kern.info] [AFT2] E$Data (0x28): 0x80000000.00000000 *Bad* PSYND=0x0080 Apr 12 17:59:18 netdev SUNW,UltraSPARC-IIi: [ID 359263 kern.info] [AFT2] E$Data (0x30): 0x00000000.00000000 Apr 12 17:59:18 netdev SUNW,UltraSPARC-IIi: [ID 359263 kern.info] [AFT2] E$Data (0x38): 0x000002a1.000b7d20 Apr 12 17:59:18 netdev SUNW,UltraSPARC-IIi: [ID 601312 kern.info] [AFT2] errID 0x00015289.afcae2ba AFAR was derived from E$Tag Apr 12 17:59:18 netdev unix: [ID 836849 kern.notice] Apr 12 17:59:18 netdev ^Mpanic[cpu0]/thread=2a10007dd20: Apr 12 17:59:18 netdev unix: [ID 455523 kern.notice] [AFT1] errID 0x00015289.afcae2ba EDP Error(s) Apr 12 17:59:18 netdev See previous message(s) for details Apr 12 17:59:18 netdev unix: [ID 100000 kern.notice] Apr 12 17:59:18 netdev genunix: [ID 723222 kern.notice] 000002a10007d200 SUNW,UltraSPARC-IIi:cpu_aflt_log+4e0 (2a10007d2be, 1, 101483a0, 2a10007d448, 2a10007d30b, 101483c8) Apr 12 17:59:19 netdev genunix: [ID 179002 kern.notice] %l0-3: 0000000000000000 000002a10007d510 0000000000000003 0000000000000010 Apr 12 17:59:19 netdev %l4-7: 0000000000200000 0000000000400000 0000000000000000 000002a10001f9c0 Apr 12 17:59:19 netdev genunix: [ID 723222 kern.notice] 000002a10007d450 SUNW,UltraSPARC-IIi:cpu_async_error+868 (1, 2a10007d510, 80400080, 0, 640000080400080, 2a10007d6d0) Apr 12 17:59:19 netdev genunix: [ID 179002 kern.notice] %l0-3: 0000000000000001 0000000000000032 0000000000000000 0000000000000000 Apr 12 17:59:19 netdev %l4-7: 0000000000000219 0000000000000000 000003000005d748 0000000000000000 Apr 12 17:59:19 netdev genunix: [ID 723222 kern.notice] 000002a10007d620 unix:prom_rtt+0 (300001b2000, 8000000000000000, a, a, 0, 0) Apr 12 17:59:19 netdev genunix: [ID 179002 kern.notice] %l0-3: 0000000000000001 0000000000001400 0000000000001600 000000001013fb54 Apr 12 17:59:19 netdev %l4-7: 0000030000697ea0 0000000000000001 000000000000000a 000002a10007d6d0 Apr 12 17:59:19 netdev genunix: [ID 723222 kern.notice] 000002a10007d770 genunix:callout_schedule_1+4 (300001b2000, 10443508, 300001b5000, 10072cf4, 0, 101424b0) Apr 12 17:59:20 netdev genunix: [ID 179002 kern.notice] %l0-3: 0000000000000008 0000000000000002 0000000000000001 000000001041b718 Apr 12 17:59:20 netdev %l4-7: 000000001041b338 0000000000000016 000000001041baf8 000002a10007d7b0 Apr 12 17:59:20 netdev genunix: [ID 723222 kern.notice] 000002a10007d820 genunix:callout_schedule+54 (104391fc, 1, 10439178, 8, 1, 300000683c8) Apr 12 17:59:20 netdev genunix: [ID 179002 kern.notice] %l0-3: 00000000100d312c 0000030000cec000 0000030000d79602 0000030000cec000 Apr 12 17:59:20 netdev %l4-7: 000003000188f040 0000000000000000 000003000148af00 000002a10051dba0 Apr 12 17:59:20 netdev genunix: [ID 723222 kern.notice] 000002a10007d8d0 genunix:clock+474 (1045a800, 1041b338, 1042dc00, 94f476874837, 0, 0) Apr 12 17:59:20 netdev genunix: [ID 179002 kern.notice] %l0-3: 0000000000000000 0000000000000001 000002a10007dd20 0000000000000000 Apr 12 17:59:20 netdev %l4-7: 000000001045a000 000000003b9aca00 000000001041baf8 00000000fed3a004 Apr 12 17:59:20 netdev genunix: [ID 723222 kern.notice] 000002a10007d9a0 genunix:cyclic_softint+a4 (1041b338, 30000057928, 1, 3, 30000068478, 10073f0c) Apr 12 17:59:20 netdev genunix: [ID 179002 kern.notice] %l0-3: 0000030000057930 800000000237f894 0000000000000000 0000030000068478 Apr 12 17:59:20 netdev %l4-7: 00000300000578c8 000003000068dea8 0000000000000000 000003000068ded0 Apr 12 17:59:21 netdev genunix: [ID 723222 kern.notice] 000002a10007da60 unix:cbe_level10+8 (0, 803, 1041b338, 2a10007dd20, 10060, 1000b34c) Apr 12 17:59:21 netdev genunix: [ID 179002 kern.notice] %l0-3: 00000000102e4934 0000000000000001 0000000000000001 0000030000070ed8 Apr 12 17:59:21 netdev %l4-7: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 Apr 12 17:59:21 netdev unix: [ID 100000 kern.notice] Apr 12 17:59:21 netdev genunix: [ID 672855 kern.notice] syncing file systems... Apr 12 17:59:21 netdev genunix: [ID 904073 kern.notice] done Apr 12 17:59:22 netdev genunix: [ID 353387 kern.notice] dumping to /dev/dsk/c0t0d0s1, offset 322174976 Apr 12 17:59:22 netdev uata: [ID 606412 kern.warning] WARNING: timeout: reset bus chno = 0 targ = 0 Apr 12 17:59:38 netdev genunix: [ID 409368 kern.notice] ^M100% done: 8116 pages dumped, compression ratio 3.96, Apr 12 17:59:38 netdev genunix: [ID 851671 kern.notice] dump succeeded And now for the second crash: echo '$c' | adb -k unix.0 vmcore.0: physmem 173a7 panicsys(104234b0,1040c198,10050068,78002000,39ff00,c) + 44 vpanic(10050068,1040c198,faabfb648,10,30000689ea8,30000068438) + cc panic(10050068,804,1,1041a798,fffd,20) + 1c sync_handler(1041a980,10400000,0,0,0,2) + 150 prom_rtt(10000000,16,f0000000,f810ca9c6,0,2) client_handler(f0066d2c,2a10007d6e8,1,104283d8,1,1041a980) + 2c prom_enter_mon(0,6,b,2a10004bd40,2a10007dd40,0) + 28 debug_enter(0,f80db6987,f80db698a,0,30001092020,0) + d0 kbdinput(1045a400,4d,30000689d68,300001b5000,0,1013dd4c) + 304 kbdrput(30000adabe8,3000108f080,30000ad3a18,3000108f080,30000689d68,30000ad39a0) + 13c putnext(30000adae48,30000ad9a90,30000adb0a8,3000108f080,0,0) + 1cc async_softint(3000108f080,1,ffff,20000,0,30000adae48) + 568 asysoftintr(3000017a008,30000b7e000,1,2a10007dd40,10180,1026fba8) + 70 intr_thread(2a10001fd40,1041b180,10423890,10423890,0,0) + a4 idle(1040f864,0,0,1041b180,3000005d6c8,0) + 54 thread_start(0,0,0,0,0,0) + 4 /var/adm/messages leading up to the reboot: Apr 24 12:20:07 netdev SUNW,UltraSPARC-IIi: [ID 370172 kern.warning] WARNING: [AFT1] EDP event on CPU0 Instruction access at TL=0, errID 0x0001d01e.baad443a Apr 24 12:20:07 netdev AFSR 0x00000000.004000f0<EDP> AFAR 0xffffffff.ffffffff Apr 24 12:20:07 netdev AFSR.PSYND 0x00f0(Score 45) AFSR.ETS 0x00 Fault_PC 0x97560 Apr 24 12:20:07 netdev UDBH 0x0000 UDBH.ESYND 0x00 UDBL 0x0000 UDBL.ESYND 0x00 Apr 24 12:20:07 netdev SUNW,UltraSPARC-IIi: [ID 798591 kern.info] [AFT2] errID 0x0001d01e.baad443a No error found in ecache (No fault PA available) Apr 24 12:20:07 netdev unix: [ID 836849 kern.notice] Apr 24 12:20:07 netdev ^Mpanic[cpu0]/thread=3000165a440: Apr 24 12:20:07 netdev unix: [ID 424580 kern.notice] [AFT1] errID 0x0001d01e.baad443a EDP Error(s) Apr 24 12:20:07 netdev See previous message(s) for details Apr 24 12:20:08 netdev unix: [ID 100000 kern.notice] Apr 24 12:20:08 netdev genunix: [ID 723222 kern.notice] 000002a1005dd6d0 SUNW,UltraSPARC-IIi:cpu_aflt_log+4e0 (2a1005dd78e, 1, 101483a0, 2a1005dd918, 2a1005dd7db, 101483c8) Apr 24 12:20:08 netdev genunix: [ID 179002 kern.notice] %l0-3: 0000000000000000 000002a1005dd9e0 0000000000000003 0000000000000010 Apr 24 12:20:08 netdev %l4-7: 0000000000200000 0000000000400000 0000000000000001 0000000000000080 Apr 24 12:20:08 netdev genunix: [ID 723222 kern.notice] 000002a1005dd920 SUNW,UltraSPARC-IIi:cpu_async_error+868 (1, 2a1005dd9e0, 4000f0, 0, 1400000004000f0, 2a1005ddba0) Apr 24 12:20:08 netdev genunix: [ID 179002 kern.notice] %l0-3: 0000000000000001 000000000000000a 0000000000000000 0000000000000000 Apr 24 12:20:08 netdev %l4-7: 0000000000004208 0000000000000000 00000000007fbdd0 0000000000000084 Apr 24 12:20:08 netdev unix: [ID 100000 kern.notice] Apr 24 12:20:08 netdev genunix: [ID 672855 kern.notice] syncing file systems... Apr 24 12:20:09 netdev genunix: [ID 733762 kern.notice] 1 Apr 24 12:20:10 netdev genunix: [ID 904073 kern.notice] done _______________________________________________ sunmanagers mailing list sunmanagers@sunmanagers.org http://www.sunmanagers.org/mailman/listinfo/sunmanagers _______________________________________________ sunmanagers mailing list sunmanagers@sunmanagers.org http://www.sunmanagers.org/mailman/listinfo/sunmanagersReceived on Fri Apr 26 18:20:33 2002
This archive was generated by hypermail 2.1.8 : Thu Mar 03 2016 - 06:42:41 EST