Well, I was a bit surprised to not have received any responses regarding this issue. Replacing the drive at target 9 resolved the SCSI errors, and I haven't seen any bus resets since then. I still don't know why the server became unresponsive. -Damian -----Original Message----- From: Wiest, Damian Sent: Thursday, July 27, 2006 3:01 PM To: 'sunmanagers@sunmanagers.org' Subject: E250 Hang Related to SCSI Errors? Hello everyone, I've been fortunate enough to not have had any major issues for some time, however one of our development machines had some problems over the past weekend. It's an old E250 with two 400MHz UltraSparc-II processors and a gigabyte of main memory; it's running the 11/99 release of Solaris 7 and I know it's not up-to-date on patches. All of the internal drive bays are populated and we're using SVM for two-way mirroring of the filesystems on these drives. Additionally, there's a Symbios card installed for an attached D1000. Recently we began to receive SCSI timeouts and transport error notifications in /var/adm/messages. For example, Jul 27 09:03:51 lcidev01 unix: /pci@1f,4000/scsi@3 (glm0): Jul 27 09:03:51 lcidev01 Cmd (0x2126f10) dump for Target 9 Lun 0: Jul 27 09:03:51 lcidev01 unix: /pci@1f,4000/scsi@3 (glm0): Jul 27 09:03:51 lcidev01 cdb=[ 0x28 0x0 0x2 0x18 0xba 0x58 0x0 0x0 0x10 0x0 ] Jul 27 09:03:51 lcidev01 unix: /pci@1f,4000/scsi@3 (glm0): Jul 27 09:03:51 lcidev01 pkt_flags=0x4000 pkt_statistics=0x60 pkt_state=0x7 Jul 27 09:03:51 lcidev01 unix: /pci@1f,4000/scsi@3 (glm0): Jul 27 09:03:51 lcidev01 pkt_scbp=0x0 cmd_flags=0x860 Jul 27 09:03:51 lcidev01 unix: WARNING: /pci@1f,4000/scsi@3 (glm0): Jul 27 09:03:51 lcidev01 Connected command timeout for Target 9.0 Jul 27 09:03:51 lcidev01 unix: WARNING: ID[SUNWpd.glm.cmd_timeout.6017] Jul 27 09:03:51 lcidev01 unix: WARNING: /pci@1f,4000/scsi@3/sd@0,0 (sd0): Jul 27 09:03:51 lcidev01 SCSI transport failed: reason 'reset': retrying command Jul 27 09:03:51 lcidev01 unix: WARNING: /pci@1f,4000/scsi@3/sd@8,0 (sd7): Jul 27 09:03:51 lcidev01 SCSI transport failed: reason 'reset': retrying command Jul 27 09:03:51 lcidev01 unix: WARNING: /pci@1f,4000/scsi@3/sd@9,0 (sd8): Jul 27 09:03:51 lcidev01 SCSI transport failed: reason 'reset': retrying command Jul 27 09:03:51 lcidev01 unix: WARNING: /pci@1f,4000/scsi@3/sd@9,0 (sd8): Jul 27 09:03:51 lcidev01 SCSI transport failed: reason 'timeout': retrying command Jul 27 09:03:51 lcidev01 unix: WARNING: /pci@1f,4000/scsi@3/sd@a,0 (sd9): Jul 27 09:03:51 lcidev01 SCSI transport failed: reason 'reset': retrying command This particular system has been up and running continuously for about two years; when I came in on Monday morning, it was not accessible via the network and the graphical display was blank. The system was responding to ping requests. As we don't have a serial console attached, I was forced to improperly power-down the machine. After it came back up, I checked the messages file and saw the following entries immediately prior to my power-down: Jul 22 16:11:08 lcidev01 unix: WARNING: /pci@1f,4000/scsi@3/sd@b,0 (sd10): Jul 22 16:11:08 lcidev01 SCSI transport failed: reason 'reset': retrying Command Metastat was showing all of the sub-mirrors on the internal disks as being in either the Maintenance or Last Errored state, and the time of the status change for these metadevices was listed as "Sat Jul 22 20:41:07 2006". A "metareplace -e" was sufficient to re-sync the devices. Iostat is currently showing transport and hard errors on the sd targets listed above. I didn't see any problems with the D1000. What I'm wondering is whether or not it's possible that either the SCSI errors or the bus reset caused the system to hang and caused the SVM errors. I'm also wondering what procedure I _should_ have followed when the system appeared to be hung as well as where besides /var/adm/messages I could look to find more information about why this occurred. From what I can tell, it's likely that I have a failing drive on the SCSI chain that's causing the timeout issues. Is a bad disk the likely culprit for the SCSI errors? Could this also be responsible for the system hanging? TIA! -Damian ps. Yes, I want to get Solaris 10 on this server, but I'm waiting until I can pull some other E250's out of production service. ============================== Confidentiality Notice: The content of this communication, along with any attachments, is the property of the sender, is covered by federal and state law governing electronic communications and may contain confidential and legally privileged information. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution, use or copying of the information contained herein is strictly prohibited. If you have received this communication in error, please immediately contact us by email at security@rc2corp.com, destroy any copies or print outs of this e-mail and permanently delete the original e-mail. Thank you _______________________________________________ sunmanagers mailing list sunmanagers@sunmanagers.org http://www.sunmanagers.org/mailman/listinfo/sunmanagersReceived on Thu Aug 3 12:07:50 2006
This archive was generated by hypermail 2.1.8 : Thu Mar 03 2016 - 06:44:00 EST