The question was:
>> On a Sun 4/490 with SunOS 4.1.1, we have messages like:
>> id000e:block 79204 (638854 abs):read:Conditional Success.
>> Data Retry Performed.
>>
>> Is this the beginning of hardware troubles, or is it a software
>> known trouble?
I was right with the second assumption. This message is "normal"
under heavy traffic. It's written in the paper manuals.
They wrote that repairing sectors can fix things. May be, but
in our case, we had messages with a rate
of 100 and more per hour. That was too much!
Sun first asked to change a disk. No change.
They then asked to change the IPI controler. Error rates
falled from 100+/hour to 3/hour. We now have a controler
dated 4/18/91. Since there were still
some errors, I reformated the disk. And now, for 2 days,
no more error.
To give hints,
- save an IPI disk on a DAT = 1h30;
- formating an IPI disk (model 9720) = 1h30;
- restoring data = 4h.
Thanks to
ca@idefix.informatik.UNI-KIEL.DBP.DE
pln@egret1.Stanford.EDU
blc@sol.med.ge.com
eeimkey@eeiua.ericsson.se
bill@ihpds1.att.com
flash.bellcore.com!breeze.bellcore.com!dan
liz@heh.cgd.ucar.EDU
ada3.ca.boeing.com!moses.boeing.com!rr6204
jaf@jupiter.Sun.CSD.unb.ca
MAL@CORNELLC.BITNET
husc6.BITNET!gauss.med.harvard.EDU!satmb
and mp@allegra.att.com for his long answer. Cf infra.
--Jacques Beigbeder
_________________________
>From mp@allegra.att.com Mon Dec 9 17:17:54 1991
Received: from dmi.ens.fr by merisier.ens.fr (4.1/88/01/19 3.0)
id AA26360; Mon, 9 Dec 91 17:17:53 +0100
Received-Date: Mon, 9 Dec 91 17:17:53 +0100
Return-Path: <mp@allegra.att.com>
Received: by dmi.ens.fr (5.57/Ulm 89/04/27 1.0)
id AA21772; Mon, 9 Dec 91 17:17:49 +0100
From: mp@allegra.att.com
Message-Id: <9112091617.AA21772@dmi.ens.fr>
Received: from CUNYVM by CUNYVM.BITNET (Mailer R2.08) with BSMTP id 2359; Mon,
09 Dec 91 11:15:09 EST
Received: from research.att.com by CUNYVM.CUNY.EDU (IBM VM SMTP V2R1) with TCP;
Mon, 09 Dec 91 11:15:07 EST
Received: by inet; Mon Dec 9 11:16 EST 1991
Date: Mon, 9 Dec 91 11:15:57 EST
To: beig@dmi.ens.fr
Subject: Re: read: conditional success
They're basically soft errors.
Sun's position in its documentation (the 4.1.1 release notes) is that
these errors occur during heavy load and aren't cause for alarm. We've
found that they occur under more circumstances than that, though.
When we got our three 4/390's 2 years ago, each with 4 or 8 1.2GB IPI
drives, all of them reported occasional read and write errors
(Conditional Success). The disks that got the most errors eventually
started getting hard errors, so they were replaced. 4 out of 16 drives
were replaced in the first few months. If you don't have a heavy
load on your systems, watch the error messages to see if they affect
just one disk and if they turn into hard errors, e.g.:
id000a: block 9104 (9104 abs): write: Uncorrectable Data Check.
since it may just be a dying disk.
With us, the soft errors still kept appearing, across all disks that
were being used, so the next thing Sun tried was to replace the
controllers with newer rev levels and change the cables and
terminators. This helped one of the servers. Then they found out that
the disks had been designed to use adhesive tape (!) to guide airflow,
and that over time the tape had become unstuck and was flapping in the
breeze rather than doing its job. They installed plastic baffles to
correct this, and the error messages seemed to come less frequently.
Then we heard there was a problem with cables having impedance that was
too high (and in fact the impedance can increase with age, which could
be one reason the error messages usually increased in frequency over
time). Replacing the cables with a set that was tested for several
weeks at Sun eliminated the problem, and we are gradually getting such
cables installed in the systems. Also, they now have cables with
filters on them, which may help - we'll be trying a set of them soon.
This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:06:22 CDT