Summary: luxadm remove_device SCSI failed error on Sunfire 280R

From: Gene Beaird <bgbeaird_at_sbcglobal.net> Date: Thu Jul 10 2008 - 12:51:31 EDT · This archive was generated by hypermail 2.1.8 : Thu Mar 03 2016 - 06:44:11 EST

Thanks to all who replied, including:
Scott Lawson,
Tim Bradshaw,
Stefan Varga,
Sandesh Kubde,
'hike',
Bryan Bahnmiller and
Robert M. Martel

We were successful in getting the disk swapped without having to reboot, or
panicking the box.  Some suggested I need a reboot to fix it, which the
customer was having none of.  Mr. Lawson suggested we use cfgadm.  The
drives WWN did show up in a 'cfgadm -al'.  But nowhere in the server
documentation did it say to use cfgadm on any of the FC-AL disks.  One of my
colleagues thought that yesterday when we were preparing for the change, and
we considered it for a bit last night, but since this was such an important
mission-critical and eggs-all-in-one-basket server, I opted to go strictly
by the book, in case of catastrophe, where I could claim I was going by the
book.  A couple of you offered that since the disk is dead to luxadm, then
you can just pull it.  It would be interesting to try these things, though,
just to see if it works.  Unfortunately, my lab is customers' production
boxes, so opportunity to experiment is limited.

We determined that, as others suggested, the disk was too far gone for
luxadm to communicate with it.  When we executed 'luxadm remove_device
<devicename>' (here <devicename> is /dev/rdsk/c1t0d0s2), luxadm couldn't
check the status of the drive, so the procedure failed after issuing the
first line to 'Make sure the filesystems were backed up....', and then it
would fail out, posting a SCSI error.  We studied the steps of
'remove_device' and determined that luxadm roughly removed the device from
the device tree, offlines it, and possibly even powers the device down.
After executing 'luxadm -e offline <devicename>', we verified the disk
didn't show in 'luxadm inq c?t?d?s2' or format.  We then executed devfsadm
-C to clear the devices from the /dev device list.  

After that, I had the DC Engineer go check to see if the light on the drive
was out.  It wasn't, but it was burning solidly, whereas the light on the
other disk was showing activity.  Since the system otherwise didn't know
about the disk, I crossed my fingers and had the Engineer swap the disk.  I
monitored the system via console and noted that picld saw the drive pulled
and re-inserted into the system.  I then verified the disk showed up in
format, and executed devfsadm -C to rebuild the /dev device list.  From then
on, it was the usual Disk Suite disk replacement process.  

Mr. Martel offered these steps for a failed disk on a A5200 array:

"I had this problem with a Sun A5200 array - disk too far gone for luxadm to
talk to it.  The procedure Sun gave me had me bypassing the ports on the
failed disk using the front panel controls - I don't know the 280R, but I'd
guess you don't have such controls available."

"What Happened after I followed Sun's special procedure to replace the 
failed disk was that was the new disk was not accessible.   I then ran 
luxadm remove_device, popped the disk out when prompted, and ran luxadm
insert_device and re-installed the replacement disk.  From then on all was
normal again."

Unfortunately, I couldn't talk to Sun, as the status of the maintenance
contract on this system is being investigated.  Even then, most of the
support we have is Gold, and this was way outside of Gold support time.
This new disk may be T&M.

Thanks to all who responded, it is nice to know at least people are out
there listening and offering help when you are stressed out, sitting at the
keyboard all alone in the middle of the night trying to keep the machine
from falling over.  

Regards,

Gene Beaird
Pearland, Texas

-----Original Message-----
From: Gene Beaird [mailto:bgbeaird@sbcglobal.net] 
Sent: Wednesday, July 09, 2008 10:34 PM
To: 'sunmanagers@sunmanagers.org'
Subject: luxadm remove_device SCSI failed error on Sunfire 280R

I have a failed disk0 on a SunFire 280R.  It is part of a mirrored pair,
mirrored with Disk Suite.  I have broken the mirror, and metacleared the
devices.  According to the SunFire 280R Service manual and Owners manual, I
am supposed to remove the bad disk from the system using luxadm
remove_device command before I physically swap the drive out.  When I
execute luxadm remove_device /dev/rdsk/c1t0d0s2, I get:

Error: SCSI failure. - /dev/rdsk/c1t0d0s2

Which is the same message I get for that disk when I execute luxadm inq
/dev/rdsk/c?t?d?s2.  I don't see a WWN in luxadm for that device.  

What's wrong and how do I get this fixed?  Thank you all.

Regards,

Gene Beaird, CISSP,
Unix Support Engineer,
Pearland, Texas
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers