Summary: Memory problems on E450

From: Lennon, Padraig <Padraig.Lennon_at_Pioneerinvest.ie>
Date: Tue May 14 2002 - 08:27:06 EDT
Hi Gurus,
Sorry about the very late summary to this post, but i only got the problem
sorted out this weekend.

It turns out that the 3 DIMMS that had got to replace the failed ones had
done the rounds. One DIMM i have since discovered had been around the loop 3
times, been returned for repair, before being sent out to another customer.
What i did was to get SUN to pre-test two existing DIMMs for one week, after
which time i got them both sent to me. The second DIMM is backup for
possible transit damage.
I only got the proven DIMM installed last week and it is fine now. 
Keep the second DIMM on-site for a week afterwards for safety.

p.s. Be sure to stress test the memory for about 24hrs after changing using
SUN VTS. 
Also ensure that the machine is fully patched up.

Thanks to everyone that replied,

Hope this helps
-Padraig 
-----------------Original Post---------------------------

Hi Guru's,
I have a strange problem with one of our E450's.  About a month ago i
started getting the following errors in /var/adm/messages stating that
Memory module 1904 was experiencing memory problems

foo unix: [ID 908439 kern.notice] [AFT0] Multiple Softerrors:
foo unix: [ID 356634 kern.notice] 3 Intermittent, 253 Persistent, and 0
Sticky Softerrors accumulated
foo unix: [ID 340762 kern.notice] from Memory Module 1904

That seemed a straightforward error and i requested that a Sun Engineer come
and change the module. When he arrived he moved a known good module into
slot 1904, and placed the new module in 1901 (This was done to ensure that
it wasn't the slot that was causing the problem). This seemed fine and we
booted the machine up again and ran SunVTS stress test. 
The same errors occured again, but this time the errors were coming from
1901. We naturally thought that the dimm was bad and replaced this again,
this time placing 1804 into 1901 and the new DIMM in 1804 ( This was done to
rule out a faulty bank that was holding the 190x Dimms. 

We booted up again and all seemed fine. SUNvts passed with no errors, and we
left it and that. A day later though,  the problems started again - this
time from 1804. However the error messages were somwhat different

foo pcipsy: [ID 758641 kern.info]     AFSR=40830000.a4800000
AFAR=00000000.d0610fa8,
foo   double word offset=5, Memory Module 1804 id 4.
foo pcipsy: [ID 553544 kern.notice] syndrome bits 83
foo pcipsy: [ID 865758 kern.warning] WARNING: correctable error from pci0
(upa mid 4) during
foo DVMA read transaction

as well as:

foo unix: [ID 908439 kern.notice] [AFT0] Multiple Softerrors:
foo unix: [ID 356634 kern.notice] 3 Intermittent, 253 Persistent, and 0
Sticky Softerrors accumulated
foo unix: [ID 340762 kern.notice] from Memory Module 1804

I got onto SUN support who told me it looked like a motherboard error. We
changed the motherboard, and again SUNvts passed all tests. 

To my disgust the errors are back again. I have run SUN explorer on the host
a number of times which SUN have analysed, but have found no problems. Their
suggestion now is to break the memory interleave, disable a bank at a time
to try isolate the problem. I can't do this however as it is a production
host and all 4gb of memory is needed. 

I have search extensively in Sunsolve etc.. for clues but to no avail. I did
notice however that some people have had problem with E450's incorrectly
diagnosing a failed DIMM. prtdiag does not show any errors at all.

Has anyone come across a problem like this before, and if so what was the
cause?

E450 spec -- 4x480mhz processors, 4gb Mem ( interwoven)
Solaris 8 Patch 108528-12
Sunvts version 4.6

I will of course summerize no matter what the outcome.
Thanks
-Padraig
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers
Received on Tue May 14 08:39:33 2002

This archive was generated by hypermail 2.1.8 : Thu Mar 03 2016 - 06:42:42 EST