SUMMARY: E250 Hang Related to SCSI Errors?

From: Wiest, Damian <dmwiest_at_rc2corp.com> Date: Thu Aug 03 2006 - 12:06:55 EDT · This archive was generated by hypermail 2.1.8 : Thu Mar 03 2016 - 06:44:00 EST

Well, I was a bit surprised to not have received any responses regarding
this issue.
Replacing the drive at target 9 resolved the SCSI errors, and I haven't seen
any bus resets since then.  I still don't know why the server became
unresponsive.

-Damian

-----Original Message-----
From: Wiest, Damian 
Sent: Thursday, July 27, 2006 3:01 PM
To: 'sunmanagers@sunmanagers.org'
Subject: E250 Hang Related to SCSI Errors?

Hello everyone,

I've been fortunate enough to not have had any major issues for some time,
however one of our development machines had some problems over the past
weekend.

It's an old E250 with two 400MHz UltraSparc-II processors and a gigabyte of
main memory; it's running the 11/99 release of Solaris 7 and I know it's not
up-to-date on patches.  All of the internal drive bays are populated and
we're using SVM for two-way mirroring of the filesystems on these drives.
Additionally, there's a Symbios card installed for an attached D1000.
Recently we began to receive SCSI timeouts and transport error notifications
in /var/adm/messages.  For example,

Jul 27 09:03:51 lcidev01 unix: /pci@1f,4000/scsi@3 (glm0):
Jul 27 09:03:51 lcidev01        Cmd (0x2126f10) dump for Target 9 Lun 0:
Jul 27 09:03:51 lcidev01 unix: /pci@1f,4000/scsi@3 (glm0):
Jul 27 09:03:51 lcidev01         cdb=[ 0x28 0x0 0x2 0x18 0xba 0x58 0x0 0x0
0x10 0x0 ]
Jul 27 09:03:51 lcidev01 unix: /pci@1f,4000/scsi@3 (glm0):
Jul 27 09:03:51 lcidev01        pkt_flags=0x4000 pkt_statistics=0x60
pkt_state=0x7
Jul 27 09:03:51 lcidev01 unix: /pci@1f,4000/scsi@3 (glm0):
Jul 27 09:03:51 lcidev01        pkt_scbp=0x0 cmd_flags=0x860
Jul 27 09:03:51 lcidev01 unix: WARNING: /pci@1f,4000/scsi@3 (glm0):
Jul 27 09:03:51 lcidev01        Connected command timeout for Target 9.0
Jul 27 09:03:51 lcidev01 unix: WARNING: ID[SUNWpd.glm.cmd_timeout.6017] Jul
27 09:03:51 lcidev01 unix: WARNING: /pci@1f,4000/scsi@3/sd@0,0 (sd0):
Jul 27 09:03:51 lcidev01        SCSI transport failed: reason 'reset':
retrying command
Jul 27 09:03:51 lcidev01 unix: WARNING: /pci@1f,4000/scsi@3/sd@8,0 (sd7):
Jul 27 09:03:51 lcidev01        SCSI transport failed: reason 'reset':
retrying command
Jul 27 09:03:51 lcidev01 unix: WARNING: /pci@1f,4000/scsi@3/sd@9,0 (sd8):
Jul 27 09:03:51 lcidev01        SCSI transport failed: reason 'reset':
retrying command
Jul 27 09:03:51 lcidev01 unix: WARNING: /pci@1f,4000/scsi@3/sd@9,0 (sd8):
Jul 27 09:03:51 lcidev01        SCSI transport failed: reason 'timeout':
retrying command
Jul 27 09:03:51 lcidev01 unix: WARNING: /pci@1f,4000/scsi@3/sd@a,0 (sd9):
Jul 27 09:03:51 lcidev01        SCSI transport failed: reason 'reset':
retrying command

This particular system has been up and running continuously for about two
years; when I came in on Monday morning, it was not accessible via the
network and the graphical display was blank.  The system was responding to
ping requests.  As we don't have a serial console attached, I was forced to
improperly power-down the machine.  After it came back up, I checked the
messages file and saw the following entries immediately prior to my
power-down:

Jul 22 16:11:08 lcidev01 unix: WARNING: /pci@1f,4000/scsi@3/sd@b,0 (sd10):
Jul 22 16:11:08 lcidev01 SCSI transport failed: reason 'reset': retrying
Command

Metastat was showing all of the sub-mirrors on the internal disks as being
in either the Maintenance or Last Errored state, and the time of the status
change for these metadevices was listed as "Sat Jul 22 20:41:07 2006".  A
"metareplace -e" was sufficient to re-sync the devices.  Iostat is currently
showing transport and hard errors on the sd targets listed above.  I didn't
see any problems with the D1000.

What I'm wondering is whether or not it's possible that either the SCSI
errors or the bus reset caused the system to hang and caused the SVM errors.
I'm also wondering what procedure I _should_ have followed when the system
appeared to be hung as well as where besides /var/adm/messages I could look
to find more information about why this occurred.  From what I can tell,
it's likely that I have a failing drive on the SCSI chain that's causing the
timeout issues.

Is a bad disk the likely culprit for the SCSI errors?  Could this also be
responsible for the system hanging?

TIA!

-Damian

ps. Yes, I want to get Solaris 10 on this server, but I'm waiting until I
can pull some other E250's out of production service.

============================== 

Confidentiality Notice:  

The content of this communication, along with any attachments, is the property of the sender, is covered by  
federal and state law governing electronic communications and may contain confidential and legally privileged 
information.  If the reader of this message is not the intended recipient, you are hereby notified that any  
dissemination, distribution, use or copying of the information contained herein is strictly prohibited.  If you have  
received this communication in error, please immediately contact us by email at security@rc2corp.com, destroy  
any copies or print outs of this e-mail and permanently delete the original e-mail.  Thank you
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers