SUMMARY: Memory group (A0) failed on V880 CPU/mem board

From: Stoyan Genov <stoyan.genov_at_gbservices.biz>
Date: Tue Feb 28 2006 - 03:56:05 EST
Good day,

Although not definitive, it seems the CPU/memory board might be
causing the trouble. I had/have no chance to replace the board now.
I have replaced the failed DIMMs, leaving group B1 empty, and no errors
so far.

Thanks to Joe Fletcher and Sandwich Maker:

Joe Fletcher wrote:

Sounds like what used to be a fairly common fault on a load of the 
UltraIII stuff. The fault is most likely a flaky system  board. A call 
in to SUN and they will replace it. I thought they'd mostly sorted the 
manufacturing problems, especially with the 1.2GHz versions and upwards 
but I guess there will always be a few duff ones around.

Sandwitch Maker wrote:

i haven't experienced it personally, but iirc there was an earlier
generation of boards [500MHz?] that would show memory errors if the
cpu itself wasn't secured properly on the board.  notoriously on these
boards the cpu heatsink screws were often either too loose or worse,
too tight, leading to phantom dimm errors.  iirc a clue was that all
memory would suddenly show bad.

Best Regards,
Stoyan Genov


Stoyan Genov wrote:
> Good day,
> 
> A fully-equipped V880 (8 x CPU @ 1.2GHz, 4 boards, 64GB RAM),
> spontaneously and irregularly restarted a couple of times.
> Logs from two days ago showed soft memory error on Slot D, J8101.
> 
> After the restarts, it showed errors in this bank no more, but reported
> all banks in the required group A0 (J3000, J3001, J2900, J2901) with
> hard errors. The machine is configured to restart on  hardware errors
> (error-reset-recovery=boot in eeprom), so I believe restarts are normal
> given the errors and the configuration.
> 
> I have asr-disable'd cpu5 and cpu7, thus cutting off access to this
> board and its memory.
> 
> I have the chance to swap the reported as faulty DIMMs in the next hour.
> 
> What itches me:
> 
> Am I too paranoic to think that simultaneous fault of all DIMMs in one
> group is not actually problems with the DIMMs?
> 
> Is it possible that the board is faulty?
> 
> Is it possible that another failed DIMM (J8101) is actually causing the
> trouble?
> 
> Any comments and advice are welcome. I will summarize.
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers
Received on Tue Feb 28 03:56:37 2006

This archive was generated by hypermail 2.1.8 : Thu Mar 03 2016 - 06:43:56 EST