Many weeks ago I posted a note about problems with our Fujitsu M226XSA
disks on a Sparc 1000 system. The general problem was that we would errors
of the form:
>polled command timeout
>esp: State=CLEARING Last State=CLEARING
>esp: Latched stat=0x97<IPND,XZERO,MSG,CD,IO> intr=0x8<FCMP> fifo 0x80
>esp: last msg out: <unknown msg>; last msg in: COMMAND COMPLETE
>esp: DMA csr=0x40040010<INTEN>
>esp: addr=fc00102a dmacnt=0 last=fc001028 last_cnt=1000
>esp: Cmd dump for Target 3 Lun 0:
>esp: cdblen=6, cdb=[ 0x8 0x10 0xcc 0xf0 0x10 0x0 ]; Status=0x0
>esp: pkt_state=0x1b<STS,XFER,SEL,ARB> pkt_flags=0x0 pkt_statistics=0x3
>esp: cmd_flags=0x10422 cmd_timeout=0
>WARNING: /io-unit@f,e0200000/sbi@0,0/dma@1,81000/esp@1,80000 (esp1):
>Connected command timeout for Target 0.0
>SCSI transport failed: reason 'reset': retrying command
I got many responses telling me that it would not work. I was too bullheaded
to listen to those responses. I got several copies of the Sun whitepaper on
SCSI. I got copies of the FAQ, and copies of several SCSI related problems.
I also got many responses telling me to set the /etc/system file to disable
{synchronous, tagged queuing, ...}. Either the "proposed" solutions did not
work, or they were deemed unacceptable for other reasons.
Finally one respondent gave us a clue. "Adam W. Feigin" <awf@iis.ee.ethz.ch>
informed me of a patch that may take care of the problem (it didn't, but the
pointer he provided proved to be key in solving the problem). After another
round of testing to determine that the problem happened on the FSBE board, we
went through the Sun patch database and found some interesting entries (all
generated by Sun personnel). The entries are included below.
The final result was that Sun asked us to install patch ID #101378-09. The
patch got rid of the problem for us. We can now run our "old slow" SCSI
disks on the SS1000 systems without problems.
Thanks to Adam W. Feigin for his assistance.
--curt
Curt Freeland
Manager, Systems Engineering
Purdue University Engineering Computer Network
(curt@ecn.purdue.edu) (317) 494-3715(voice) / (317) 494-6440(fax)
===========================================================================
Bug Id: 1132229
Category: kernel
Subcategory: driver
Release summary: s1093_alpha2.0, s1093_alpha1.0
Synopsis: esp: stacked cmds fail on LX with FAS101 chip and slow conner 200MB
drive
Integrated in releases: s1093_alpha2.3
Patch id:
Description:
The esp chip has a feature, cmd stacking, which allows two commands to be
issued
which reduces the number of interrupts per request. This works reliably on
FAS236
and older esp's but not on FAS101. The chip fails to interrupt when the target
has
disconnected from the bus. This causes a timeout condition:
WARNING: /iommu@0,10000000/sbus@0,10001000/espdma@4,8400000/esp@4,8800000
(esp0):
(esp0):
Connected command timeout for Target 3.0
WARNING: esp: State=CLEARING (0x8), Last State=CLEARING (0x8)
WARNING: esp: Cmd dump for Target 3 Lun 0:
WARNING: esp: cdb=[ 0xa 0x0 0x66 0xa0 0x10 0x0 ]
WARNING: /iommu@0,10000000/sbus@0,10001000/espdma@4,8400000/esp@4,8800000
(esp0):
Status=0x0
DEBUG: esp0: polled command timeout
DEBUG: esp: State=CLEARING Last State=CLEARING
DEBUG: esp: Latched stat=0x97<IPND,XZERO,MSG,CD,IO> intr=0x8<FCMP>
fifo 0x80
DEBUG: esp: last msg out: <unknown msg>; last msg in: COMMAND
COMPLETE
DEBUG: esp: DMA csr=0xa4200010<INTEN>
DEBUG: esp: addr=fc00000a dmacnt=8000 last=fc000008 last_cnt=fc00
DEBUG: esp: Cmd dump for Target 3 Lun 0:
DEBUG: esp: cdblen=6, cdb=[ 0xa 0x0 0x66 0xa0 0x10 0x0 ];
Status=0x0
DEBUG: esp: pkt_state=0x1f<STS,XFER,CMD,SEL,ARB> pkt_flags=0x1
pkt_statistics=0x43
DEBUG: esp: cmd_flags=0x10c62 cmd_timeout=0
WARNING:
/iommu@0,10000000/sbus@0,10001000/espdma@4,8400000/esp@4,8800000/sd@3,0 (sd3):
SCSI transport failed: reason 'timeout': retrying command
WARNING:
/iommu@0,10000000/sbus@0,10001000/espdma@4,8400000/esp@4,8800000/sd@3,0 (sd3):
SCSI transport failed: reason 'reset': retrying command
WARNING:
/iommu@0,10000000/sbus@0,10001000/espdma@4,8400000/esp@4,8800000/sd@1,0 (sd1):
SCSI transport failed: reason 'reset': retrying command
WARNING: /iommu@0,10000000/sbus@0,10001000/espdma@4,8400000/esp@4,8800000
(esp0):
Connected command timeout for Target 3.0
WARNING: esp: State=CLEARING (0x8), Last State=CLEARING (0x8)
WARNING: esp: Cmd dump for Target 3 Lun 0:
WARNING: esp: cdb=[ 0xa 0x0 0x66 0xa0 0x10 0x0 ]
WARNING: /iommu@0,10000000/sbus@0,10001000/espdma@4,8400000/esp@4,8800000
(esp0):
Status=0x0
DEBUG: esp0: ILLEGAL bit set
DEBUG: esp: State=CLEARING Last State=CLEARING
DEBUG: esp: Latched stat=0x90<IPND,XZERO> intr=0x60<ILL,DISC> fifo
0x80
DEBUG: esp: last msg out: <unknown msg>; last msg in: COMMAND
COMPLETE
DEBUG: esp: DMA csr=0xa4200010<INTEN>
DEBUG: esp: addr=fc00000a dmacnt=8000 last=fc000008 last_cnt=2000
DEBUG: esp: Cmd dump for Target 3 Lun 0:
DEBUG: esp: cdblen=6, cdb=[ 0xa 0x0 0x66 0xa0 0x10 0x0 ];
Status=0x0
DEBUG: esp: pkt_state=0x1f<STS,XFER,CMD,SEL,ARB> pkt_flags=0x1
pkt_statistics=0x43
DEBUG: esp: cmd_flags=0x10c62 cmd_timeout=0
WARNING:
/iommu@0,10000000/sbus@0,10001000/espdma@4,8400000/esp@4,8800000/sd@3,0 (sd3):
SCSI transport failed: reason 'reset': retrying command
This occurred during PIT testing on LX with conner 200MB drive which is known
to have slow disconnect timing
The timeout recovery fails because raising ATN to start an abort operation
is not legal when the target has already disconnected
The description field as copied from bug report 1134028 follows:
(c boire 6/15/93)
Various error messages for SCSI peripherals are reported when running sundiag.
These error messages include:
010.34.999.9085 06/14/93 17:11:18 c0t0d0 rawtest ERROR: Big read failed on
disk, in-between blocks 0 and 126: I/O error.
028.38.999.9084 06/14/93 18:21:41 1 tapetest ERROR: Big write failed on
/dev/rmt/1ln, block 2084167: I/O error, sense key(0x0) = no sense.
028.38.999.9081 06/15/93 02:52:51 0 tapetest ERROR: Couldn't open /dev/rmt/0l:
I/O error.
004.27.999.9013 06/08/93 16:49:47 rdsk/c0t1d0s0 cdtest ERROR: Fail to read
205312 bytes at block 22527 (No such device or address)
The common theme seems to be systems with small disks (<=207MB) and Exabyte
tape drives. Similar configurations without small disks run without problems.
This problem has been seen on 2 different configurations running S1093 alpha1.6
and alpha2.0.
The first configuration this problem was encountered on was:
4/30
48MB Memory
Sbus: SPORT-8, FSBE
Peripheral Options: floppy disk
SCSI host adaptors:
C0 - macio
t0 - D:207MB,Quantum, LBox
t1 - T:150MB,Archive,LBox
t2 - T:5.0GB,8mm,Exabyte,DBox
t3 - D:LP535MB,Seagate,Internal
C1 - FSBE
t0 - D:104MB,Quantum,LBox
t1 - D:1.05GB,Seagate,LBox
t2 - CD:Sony,LBox
t3 - T:2GB,4mm,Connor,LBox
t4 - T:2.3GB,8mm,Exabyte,DBox
C2 - SPORT-8
t0 - D:1.05GB,Seagate,LBox
t1 - D:1.05GB,Seagate,LBox
t2 - D:1.05GB,Seagate,LBox
t3 - D:LP535MB,Seagate,LBox
Note that the label of c0t0 (D:207MB,Quantum, LBox) consistently gets
trashed when this problem occurs. See elmer:/export/bugtraq/etc/attached/\
[bugid]/config1/probe+format.out.Z.
See the messages, prtconf.out, console error messages, and sundiag* files in:
elmer:/export/bugtraq/etc/attached/[bugid]/config1
The second configuration was simpler:
4/15
16MB Memory
Sbus: FDDI (no software installed), 2nd Ethernet
SCSI host adaptor:
C0 - macio
t0 - D:LP207MB,Connor,LBox
t1 - CD:Sony,LBox
t2 - T:2.3GB,8mm,Exabyte,DBox
t3 - D:1.05GB,Seagate,Internal
Note that this problem could not be reproduced when c0t0
(D:LP207MB,Connor,LBox)
was removed from the SCSI chain.
See the messages, prtconf.out, console error messages, and sundiag* files in:
elmer:/export/bugtraq/etc/attached/[bugid]/config2
(c boire 6/16/93)
I've now reproduced this on the following configuration:
4/15
32MB Memory
Sbus: FDDI (no software installed), CG6
SCSI host adaptor:
C0-macio
t0 - D:1.05GB,Seagate,,LBox
t1 - D:1.3GB,Seagate,DBox
t2 - T:5.0GB,8mm,Exabyte,DBox
t3 - D:424MB,Seagate,LBox
t4 - CD+:Sony,LBox
Note that the exact same configuration without t4 (CD+:Sony,LBox) ran
fine over this past weekend: 68+ hours, 13 system passes, 0 errors.
See the messages, prtconf.out, and sundiag files in:
elmer:/export/bugtraq/etc/attached/1134028/config3
(c boire 6/28/93)
Now reproduced this problem on an SLC configured as:
Memory: 16MB (4x4)
SCSI(53C90A)(0.2m):
D:LP207MB-Con-LunchBox(1.1m)
D:104MB-Qua-LunchBox(1.1m)
D:424MB-Sea-LunchBox(1.1m)
CD-Son-LunchBox(1.1m)
T:150MB-Arc-LunchBox(1.1m)
but not on a similarly configured ELC. I'm swapping the SLC and ELC chassis to
see if the problem follows the platform or the config. Ping me for sundiag
error
messages or messages file if needed, but it's basically the same--seems to be
a greater percentage of "couldn't open" and "couldn't close" the tape device,
though,
and some "sense key = 0x03 media error" messages.
History:
Submitter: frits Date: 06/02/93
Dispatch Operator: bugtraq Date: 06/02/93
Evaluator: frits Date: 06/02/93
Commit Operator: frits Date: 06/02/93
Fix Operator: ksam Date: 06/23/93
Integrating Operator: ksam Date: 06/23/93
Verify Operator: chris.boire@east Date: 09/27/93
Closeout Operator: bugmail Date: 11/30/93
===========================================================================
Bug Id: 1173973
Category: kernel
Subcategory: driver
Release summary: s1093
Synopsis: esp: scsi resets occuring more often with newer fab FAS286 chips
Integrated in releases:
Patch id:
Description:
On Sun4d systems (both sundragons and scorpions) we are getting more
scsi timeout resets now with the newer reduce die FAS286 chips (
2400150).
They are no longer making the old style chip (2400121) anymore .
It doesn't seem to be configuration dependent.
On sundragons it occurs most often on tape devices.
The TTF is from 18 to 30 hrs and occurs on 10% to 40% of the
sundragon systems .
It also shows up more often with the 1/2 height exabyte 8 mm tape
drive.
Work around:
none
History:
Submitter: lester Date: 08/03/94
Dispatch Operator: bugtraq Date: 08/03/94
Evaluator: frits Date: 08/03/94
Commit Operator: frits Date: 08/05/94
===========================================================================
Bug Id: 1162452
Category: aurora
Subcategory: scsi
Release summary: aurora_p2
Synopsis: Aurora system getting SCSI errors running LST tape and disk test
Integrated in releases:
Patch id:
Description:
David Bohman Mar. 31, 1994
A second Aurora configuration (Conf-3 sst9967) is getting SCSI errors.
This system fails running LST disk and tape test under Solaris 2.3 EdIII (fcs
with out MS1 patches). The targets units being tested were 0,1,2,3,4 and 5 on
the system SCSI, first FSBE target 0 and second FSBE target 1.
The console messages indicates a polled command time out then a
SCSI reset. I beleive the SCSI reset causes the lst tape test to fail and
the tape test then stop running. After the tape test stop running I have
not seen any SCSI errors generated.
The LST disk and tape tests has run over night successfully by
removing one tape from the system SCSI bus. I removed target 5 (5.0Gbyte
8mm tape DBox) from the system SCSI bus and the LST disk and tape tests ran
overnight.
The system has run sundiag successfully overnight with all SCSI
devices selected.
Conf-3 system SCSI bus is the maximum length of 6 meters. The devices
on the system SCSI bus is listed below.
1.05Gbyte Connar internal - target 1
535Mbyte Seagate internal - target 3
Internal Slim CD - target 6
104Mbyte LBox - target 0
LP 1.05Gbyte LBox - target 2
2.3Gbyte 8mm tape DBox - target 4
5.0Gbyte 8mm tape DBox - target 5
The SCSI bus length from the SCSI hand book is listed below.
Aurora (SS5) - 1.6
(4) 530-1793-02 - 3.2
104Mbyte LBox - .3
LP 1.05Gbyte LBox - .3
2.3Gbyte 8mm tape DBox - .3
5.0Gbyte 8mm tape DBox - .3
----
Total - 6.0
note: 150-1785-02 SCSI terminator is being used
A sample of the SCSI errors seen in the system console window can
be found in:
/net/elmer/export/bugtraq/etc/attacched/<Bug ID>/conf-3.console.mesage.Z
A sample of the LST report after a failure can be found be in:
/net/elmer/export/bugtraq/etc/attacched/<Bug ID>/conf-3.lst.report.Z
The full Aurora system configuration is listed below:
Aurora (2.3 swift 70MHz)
88 MByte Memory
(2) 32Mbyte simms (J300, J301)
(3) 8Mbyte simms (J302, J303, J400)
Sbus slot 1 - FSBE (5012015)
Sbus slot 2 - CG3 (5011718)
Sbus slot 3 - FSBE (5012015)
Internal floppy
System SCSI interface:
1.05Gbyte Connar internal - target 1
535Mbyte Seagate internal - target 3
Internal Slim CD - target 6
104Mbyte LBox - target 0
LP 1.05Gbyte LBox - target 2
2.3Gbyte 8mm tape DBox - target 4
5.0Gbyte 8mm tape DBox - target 5
FSBE slot1 interface to --> 207Mbyte LBox (target 0)
FSBE slot2 interface to --> 1.05Gbyte LBox (target 1)
History:
Submitter: david.bohman@east Date: 03/31/94
Dispatch Operator: bugtraq Date: 03/31/94
Evaluator: scott.oconnor@east Date: 03/31/94
Closeout Operator: tsoydan Date: 05/06/94
===========================================================================
Patch-ID# 102002-01
Keywords: esp scsi resets with newer fab FAS286 chips
Synopsis: SunOS 5.4: esp: scsi resets occuring often with newer fab FAS286
chips
Date: Sep/07/94
Solaris Release: 2.4
SunOS release: 5.4
Unbundled Product:
Unbundled Release:
Topic: SunOS 5.4: esp: scsi resets occuring more often with newer fab FAS286
chips
BugId's fixed with this patch: 1173973
Changes incorporated in this version:
Relevant Architectures: sparc
Patches accumulated and obsoleted by this patch:
Patches which conflict with this patch:
Patches required with this patch:
Obsoleted by:
Files included with this patch:
/kernel/drv/esp
Problem Description:
1173973 esp: scsi resets occuring more often with newer fab FAS286 chips
On Sun4d systems (both sundragons and scorpions) we are getting more
scsi timeout resets now with the newer reduce die FAS286 chips ( 2400150).
They are no longer making the old style chip (2400121) anymore .
It doesn't seem to be configuration dependent.
On sundragons it occurs most often on tape devices.
It also shows up more often with the 1/2 height exabyte 8 mm tape drive.
This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:09:10 CDT