Summary: Reboot with on reason

From: Walse Chen <walsec_at_hotmail.com> Date: Fri Feb 01 2002 - 13:55:27 EST · This archive was generated by hypermail 2.1.8 : Thu Mar 03 2016 - 06:42:33 EST

Many thinks to Moore, L. Bryan, Joe Fletcher, Eduardo Sanchez M., Shawn 
Russell, Tim Chipman, Hindley Nick, Smith Cathy-CSMITH4, Steve Beuttel, 
Edward Scown, Chris Keladis, Vlad, Buddy Lumpkin, Rick McKinney.

My original question is:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Machine was automatically reboot yesterday. No clue in log files. I've
got a crash dump in /var/crash/<machine name>/vmcore.1, run
adb -k unix vmcore.1
$<msgbuf

I've got
_______________________________________________________
........
WARNING: [AFT1] Uncorrectable Memory Error on CPU1 Data access a
t TL=0, errID 0x00013161.1f9f27e0
    AFSR 0x00000000.80200000<PRIV,UE> AFAR 0x00000000.f9569c00
    AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_PC 0x10072e1
8
    UDBH 0x0203<UE> UDBH.ESYND 0x03 UDBL 0x0000 UDBL.ESYND 0x00

    UDBH Syndrome 0x3 Memory Module Board 0 J3100 J3200 J3300 J3
400 J3500 J3600 J3700 J3800
WARNING: [AFT1] errID 0x00013161.1f9f27e0 Syndrome 0x3 indicates
that this may not be a memory module problem
[AFT2] errID 0x00013161.1f9f27e0 PA=0x00000000.f9569c00
    E$tag 0x00000000.08401f2a E$State: Shared E$parity 0x04
[AFT2] E$Data (0x00): 0x65b631b8.20000000 *Bad* PSYND=0xff00
[AFT2] E$Data (0x08): 0x122e7c40.11d2fbc0
[AFT2] E$Data (0x10): 0x1243fc00.1243fc00
[AFT2] E$Data (0x18): 0x00000000.00f72000
[AFT2] E$Data (0x20): 0x00000000.00000000
[AFT2] E$Data (0x28): 0x00000000.00000000
[AFT2] E$Data (0x30): 0x00000000.0006eabd
[AFT2] E$Data (0x38): 0x02020000.00000000
WARNING: [AFT1] CP event on CPU5 (caused Data access error on CP
U1), errID 0x00013161.1f9f27e0
    AFSR 0x00000000.01000800<CP> AFAR 0x00000000.f9569c00
      AFSR.PSYND 0x0800(Score 95) AFSR.ETS 0x00
      UDBH 0x0000 UDBH.ESYND 0x00 UDBL 0x0000 UDBL.ESYND 0x00
  [AFT2] errID 0x00013161.1f9f27e0 PA=0x00000000.f9569c00
      E$tag 0x00000000.19401f2a E$State: Owner E$parity 0x0c
  [AFT2] E$Data (0x00): 0x65b631b8.20000000 *Bad* PSYND=0x0800
  [AFT2] E$Data (0x08): 0x122e7c40.11d2fbc0
  [AFT2] E$Data (0x10): 0x1243fc00.1243fc00
  [AFT2] E$Data (0x18): 0x00000000.00f72000
  [AFT2] E$Data (0x20): 0x00000000.00000000
  [AFT2] E$Data (0x28): 0x00000000.00000000
  [AFT2] E$Data (0x30): 0x00000000.0006eabd
  [AFT2] E$Data (0x38): 0x02020000.00000000
  panic[cpu1]/thread=0x63f03ba0: [AFT1] errID 0x00013161.1f9f27e0
  UE Error(s)
      See previous message(s) for details
  syncing file systems... 2 2 2panic[cpu1]/thread=0x30053e80: pani
  c sync timeout
---------------------------------------------------------------------
I cannt tell if CPU1 make system panic or memory form above output.
Can anyone help me what I shoud do next to certain what make system panic.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Steve Beuttel pointed the reason, so I copy his answer as following:

If this is a 400MHz. or faster, CPU, then I believe this means CPU5's ecache
lost address data that hosed an address location in the RAM on Board 0, 
where
CPU1 later tried to access it. What happens is that the access indexes a
location
(no longer in the ecache) that is gone and so an out of bounds read or write
results,
causing the panic. It may happen in 5 minutes or not again for months. I 
would
at
least get CPU5 replaced (it's that old ecache problem). They'll want to wait
until it
happens again, but this is classic.

Thanks you all.

_________________________________________________________________
MSN Photos is the easiest way to share and print your photos: 
http://photos.msn.com/support/worldwide.aspx
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers