SUMMARY: Uncorrectable Memory Error on CPU0

From: Michael Hase <michael_at_six.de> Date: Mon Jul 23 2001 - 10:11:00 EDT · This archive was generated by hypermail 2.1.8 : Wed Mar 23 2016 - 16:24:59 EDT

Our problem was:

> tonight our E250 (2x400MHz) running Solaris 7 crashed with a kernel
> panic. The box was patched just this week with kernel patch 106541-16,
> could this be a reason? Or should we replace the cpus?
> 
> Another problem on the box: since a power loss some weeks ago the
> external disk(s) on one fast wide scsi channel (dual glm controller)
> reduce their transfer rate from 40 to 20mb/sec. This occurs after load
> on the scsi bus, sometimes only one disk, sometimes both. We already
> changed the cable. Any ideas?

Bob Rahe and Tim Chipman suggested, that it is the well known ecache
error on CPU0 (of which we only heard by rumours til now). And Sun
might replace the cpu on the second occurance of the problem. This is
also the sense of Christer Erikssons mail, who gracefully sent the
explorer scripts (thank you).

Another suggestion (Nick Hindley) was to check output of prtdiag (we
already did) and to try reseating the disks (we will when system can
be brought down).

Scott Kulp suggested to check memory dimms, scsi cables (that was
done) and terminators (we will).

Mike DeMarco mentioned the scsi controller (which is suspicious to us
too) or as another possibility the power supply (noise on the 5 volt
power grid).

It seems, that the two problems are not directly related. For the cpu
problem, we'll wait for the second occurance, because we can tolerate
downtime for a crash with a reboot. For the scsi problem we try to
change every piece in the chain, there are two similiar channels and
only one has the problem, so that should be easy.

Thanks all,
Michael

-- 
Michael Hase                   Six Offene Systeme GmbH
michael@six.de                 Sielminger Str. 63
http://www.six.de              70771 Leinfelden-Echterdingen
phone +49 711 99091 62         Germany