Summary: Which DIMM was replaced in V880.

From: jason kappy <jasonkappy_at_yahoo.com>
Date: Fri May 13 2005 - 09:50:48 EDT
Sorry for the late summary.Thanks to Michael, Joe, Grzegorz, John, David.
 To find the DIMM you replaced,
a)Interesting. I
 1. conncect Unix of WinBox to ttya (on Sun make a tip hardwire, this 
connects from ttyb to your server, where you have cabled to ttya
2. Halt system
3. set OBProm values (this varies from system to system): diag-switch 
to on, diag-level to max, mfg-mode to true or on, and then power off 
server completely.
4. Power on
5. record output, this takes about 2-5 minutes.
 
b)This I realized when I was shipping back my swapped out DIMM.
If you happen to have the packing material they sent you, it has the SerialNumber on
the shipping label and also on the Anti-static cover. Compare this against the SN on DIMM.
My observation is that SN=PartNumber"*". Ex 501503078B3F212. Here PN=5015030
 
I used above to altleast be sure that old DIMM was replaced. However enlightened by forum members, I am sure that its the fault of the system. I will change the Uniboard and memory  if I start getting panics/large number errors per day. For now I dont want to mess around the system. I am adding this emails so that it would be helpful to others.
 

Michael:
if you replaced the memory module that was in the dmsg, etc., you only 
replaced the memory module that was reporting the error.
the actual error could come from any memory module in that specific 
memory bank.the sun systems handbook has a diagram of the systemboard with the 
slots numbers.of course, other parts (like the systemboard could be failing-though 
not likely). I AM SURPRISED THAT SUN DID NOT ASK OFFER TO REPLACE ALL FOUR.
 
Joe:
Get SUN to replace the system board. The contruction of many
of the uniboards was not all it should have been. I've lost
count of the one's I've replaced in V880s.
 
SUN's ultraIII kit has a history of memory and CPU problems.
There was an advisory issued last year which actually
admitted to the manufacturing problems. I'll forward it to
you if I can find it. If you look in the archives you'll
find an old post of mine where I did a survey of people with
V880s. 80% failure rate based on the repsonses I received.
 
SUN's standard response to anythhing like this is
patch/update firmware then send us an explorer. Send them
the explorer by all means but just insist you want the board
replaced. If they've already replaced the supposedly bad
DIMM and the error persists then the problem is either with
the entire bank or the board.
 
Grzegorz,
If this happens with no side effect this means
that you have non-fatal memory error i.e.
one bit error (intermitent or persistent). The memory 
is ECC so system can correct the error. The only pain could
be if you had massive such errors so corrections
would be too often. SUN has own "standards" about
that (some 30 ecc errors during 6 hour period or so).
The periodic messages are rather result of intrnal
watchdog checking for memory  errors ... Depending
on unknown (for me) alghoritm it periodically 
(eg each 12 hours) looks into offending memory 
banks. I have such behaviour also :-)
well, maybe this is the same problem I had half year ago ...
 
I have panics & reboots because of memory errors.
SUN engineer replaced few offending memory modules according
to logs in messages file and also according to extensive
poweron tests (OBP). After week or so I got again the same
problems, also pointing to just replaced memory modules.
Then SUN engineer replaced all (!!) memory modules on
the cpu/memory board. He told me that a long serie of
memory modules for V880 servers produced in 2002 (or maybe
little earlier or later) had inconsistent chip on memory module
which is responsible (this chip) for reporting errors 
(he called it "philips chip" if I remember correctly)
This inconsistency results that the error on one module
may appear (be reported) by other module ...
That was the case for my V880 produced in June 2002 (we got
it in August 2002). After replacement of all memory modules
panics & reboots went away :-)
SUN has internal problem report (PR) on this . One can
check if it is a problem looking on memory module 
and especially on the name & numbers on the chip ...
What is interesting that the problem appears only in 
case of V480/880. The memory modules put in other servers
behaves normally...
Hope this helps.
 
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Hello Gurus,
> > >          As I was getting persistent errors on of
> > the
> > > DIMM's for a while, last week I replaced on the
> > DIMMs
> > > on our v880 server. DIMM was shipped by Sun. After
> > a
> > > week, I am getting errors again on the same DIMM.
> > This
> > > makes me worry "Did I put it in the wrong place?".
> > The
> > > only thing to reduce my grief is that the error
> > moved
> > > to a different bit. Earlier, I got the error
> > > consistently on bit 5, now it moved to bit 35. I
> > am
> > > sure I put it in the rigth CPU slot. The most
> > probable
> > > mistake I could have made is put the DIMM in a
> > DIMM
> > > slot above/below. If I had a chance( which is not
> > > likely to happen) to open box, is there any 
> > number on
> > > the new DIMM I can look for? I wish I had put some
> > > kind of mark on the new DIMM. I feel I am being
> > too
> > > imaginative.
> > > 
> > > I will summarize.
> > > 
> > > Thank you,
> > > J.
_--------------------------------------------------------------------------------------------------------------------------------------
 
 
Thank you,
J.


		
---------------------------------
Yahoo! Mail
 Stay connected, organized, and protected. Take the tour
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers
Received on Fri May 13 09:51:21 2005

This archive was generated by hypermail 2.1.8 : Thu Mar 03 2016 - 06:43:46 EST