Lads, Many thanks to those who responded. Seth Rothenberg Nick Hedley Mike Kiernan Shannon Ward and most important Joseph Herpers.(joeh@stsolutions.com) The original post is below. It involved some very awkward crashing with inconsistent memory errors. It was obvious from the original posting that there was a hardware issue. Some got confused and thought that the errors automatically where indicative of a memory error. This is a very dangerous approach, as an error in memory write can be caused from processing, bus transfer, io management etc. Once it gets to the ram if it is a bad instruction the ram will choke because it does not understand what it is to do. Joe pointed out a tool that can be used to detect the error meaning. The Software is called the ON-Line Detective for Sun, you can see info on it at www.sundetective.com. One of the resuls after searching for my particular error demonstrated that a failure from a DMA write request was from a defect in the Enterprise Server Board. Sun engineers confirmed this (after comming out for the second time), have replaced the board, and we are now off to the races. Many thanks for this list, and those who responded. Cian O'Sullivan -----Original Message----- From: sunmanagers-admin@sunmanagers.org [mailto:sunmanagers-admin@sunmanagers.org]On Behalf Of Cian O'Sullivan Sent: Monday, July 02, 2001 9:54 AM To: sunmanagers@sunmanagers.org Subject: e4500 Crashing. Lads, I have an e4500 that is crashing without explanation. Brief outline The symptoms are that you boot it into extended diagnostics and it gives wildly differing simm errors every time, sometimes it boots to the os, sometimes (as now ) it doesn't even boot to the obp. If you boot it off a single cpu/mem board at a time, it comes up fine, as soon as you start adding boards in it goes wonky again. A quick poll of the board temps on the other adjacent e4500s show that the cpu/mem boards are within the operating env limits (just ... ie below 40 degrees) Sun engineers came in and have given all cpu/mem and i/o boards a full health check, run extended diagnostics and made some OS / operating environment recommendations which have now been implemented. The system has been stress tested overnight and appeared stable. However it crashed again. Here are some segments from the syslog. Any comments would be most apprecaited, as we are now at our wits end. Piece 1. Jun 27 02:49:26 dublin232 unix: CE Error queue wrapped Jun 27 02:49:26 dublin232 last message repeated 1 time Jun 27 02:49:29 dublin232 unix: Multiple Softerrors: Jun 27 02:49:29 dublin232 unix: Seen 4 Intermittent and 2 Corrected Softerrors Jun 27 02:49:29 dublin232 unix: from SIMM Board 2 J3200 Jun 27 02:49:30 dublin232 unix: Enabling verbose CE messages. Jun 27 02:49:30 dublin232 unix: Softerror: Intermittent ECC Memory Error SIMM Board 2 J3200 Jun 27 02:49:30 dublin232 unix: ECC Data Bit 45 was corrected Jun 27 02:49:30 dublin232 unix: CPU8 CE Error: AFSR 0x00000000 00100000, AFAR 0x00000000 638ed060, SIMM Board 2 J3200 Piece 2 Jun 27 02:49:30 dublin232 unix: Syndrome 0x2c, Size 3, Offset 0 UPA MID 8 Jun 27 02:50:01 dublin232 unix: CPU12 CE Error: AFSR 0x00000000 00100000, AFAR 0x00000001 92ab30a0, SIMM Board 4 J3200 Jun 27 02:50:01 dublin232 unix: Syndrome 0x2c, Size 3, Offset 0 UPA MID 12 Jun 27 02:50:01 dublin232 unix: Softerror: Intermittent ECC Memory Error SIMM Board 4 J3200 Jun 27 02:50:02 dublin232 unix: ECC Data Bit 45 was correctedReceived on Mon Jul 2 17:13:34 2001
This archive was generated by hypermail 2.1.8 : Wed Mar 23 2016 - 16:24:58 EDT