Hi Gurus, Sorry about the very late summary to this post, but i only got the problem sorted out this weekend. It turns out that the 3 DIMMS that had got to replace the failed ones had done the rounds. One DIMM i have since discovered had been around the loop 3 times, been returned for repair, before being sent out to another customer. What i did was to get SUN to pre-test two existing DIMMs for one week, after which time i got them both sent to me. The second DIMM is backup for possible transit damage. I only got the proven DIMM installed last week and it is fine now. Keep the second DIMM on-site for a week afterwards for safety. p.s. Be sure to stress test the memory for about 24hrs after changing using SUN VTS. Also ensure that the machine is fully patched up. Thanks to everyone that replied, Hope this helps -Padraig -----------------Original Post--------------------------- Hi Guru's, I have a strange problem with one of our E450's. About a month ago i started getting the following errors in /var/adm/messages stating that Memory module 1904 was experiencing memory problems foo unix: [ID 908439 kern.notice] [AFT0] Multiple Softerrors: foo unix: [ID 356634 kern.notice] 3 Intermittent, 253 Persistent, and 0 Sticky Softerrors accumulated foo unix: [ID 340762 kern.notice] from Memory Module 1904 That seemed a straightforward error and i requested that a Sun Engineer come and change the module. When he arrived he moved a known good module into slot 1904, and placed the new module in 1901 (This was done to ensure that it wasn't the slot that was causing the problem). This seemed fine and we booted the machine up again and ran SunVTS stress test. The same errors occured again, but this time the errors were coming from 1901. We naturally thought that the dimm was bad and replaced this again, this time placing 1804 into 1901 and the new DIMM in 1804 ( This was done to rule out a faulty bank that was holding the 190x Dimms. We booted up again and all seemed fine. SUNvts passed with no errors, and we left it and that. A day later though, the problems started again - this time from 1804. However the error messages were somwhat different foo pcipsy: [ID 758641 kern.info] AFSR=40830000.a4800000 AFAR=00000000.d0610fa8, foo double word offset=5, Memory Module 1804 id 4. foo pcipsy: [ID 553544 kern.notice] syndrome bits 83 foo pcipsy: [ID 865758 kern.warning] WARNING: correctable error from pci0 (upa mid 4) during foo DVMA read transaction as well as: foo unix: [ID 908439 kern.notice] [AFT0] Multiple Softerrors: foo unix: [ID 356634 kern.notice] 3 Intermittent, 253 Persistent, and 0 Sticky Softerrors accumulated foo unix: [ID 340762 kern.notice] from Memory Module 1804 I got onto SUN support who told me it looked like a motherboard error. We changed the motherboard, and again SUNvts passed all tests. To my disgust the errors are back again. I have run SUN explorer on the host a number of times which SUN have analysed, but have found no problems. Their suggestion now is to break the memory interleave, disable a bank at a time to try isolate the problem. I can't do this however as it is a production host and all 4gb of memory is needed. I have search extensively in Sunsolve etc.. for clues but to no avail. I did notice however that some people have had problem with E450's incorrectly diagnosing a failed DIMM. prtdiag does not show any errors at all. Has anyone come across a problem like this before, and if so what was the cause? E450 spec -- 4x480mhz processors, 4gb Mem ( interwoven) Solaris 8 Patch 108528-12 Sunvts version 4.6 I will of course summerize no matter what the outcome. Thanks -Padraig _______________________________________________ sunmanagers mailing list sunmanagers@sunmanagers.org http://www.sunmanagers.org/mailman/listinfo/sunmanagersReceived on Tue May 14 08:39:33 2002
This archive was generated by hypermail 2.1.8 : Thu Mar 03 2016 - 06:42:42 EST