First question I've sent here and not received a single answer. In my experience this usually means the question is badly phrased or just too boring ;). On either count I apologize :) I still don't understand how/why disksuite got itself into this state. I guess I'll file this in the 'Acts of God' folder. all the best, Mike -- Michael Kiernan Onet.pl S.A. Krakow, Poland http://www.onet.pl/ > Last night we lost a 170GB Disksuite RAID-5 volume. > We were unable to rescue any data and are still > running a recovery. > > E420R running Solaris 8, Disksuite 4.2.1. > > There are 6 disks in the volume, all on controller 1 > (no comments). At the time, metastat reported one > disk as 'Maintenance' then another as 'Last Erred'. > > c1t10d0s0 5082 No Maintenance > c1t11d0s0 5082 No Okay > c1t12d0s0 5082 No Last Erred > > sd24 = c1t10d0s0 -> ../../devices/pci@1f,4000/scsi@3,1/sd@a,0:a > sd25 = c1t11d0s0 -> ../../devices/pci@1f,4000/scsi@3,1/sd@b,0:a > sd26 = c1t12d0s0 -> ../../devices/pci@1f,4000/scsi@3,1/sd@c,0:a > > The first error (in the messages file) does indeed > appear to have occurred on the disk that was marked > Maintenance, and the second error on the Last Erred > disk. The timestamp for the messages is the same. > An error appears a few secs later for the 3rd disk > shown above. > > The problem is that the disk that had actually failed > was the one that was marked 'Last Erred' (ie t12) and > not the one that was marked 'Maintenance' (t10). > On replacing the physically bad disk t12 it was thus > not possible to add t10 back [there is no way to remove > the newly added empty disk from the volume]. metareplacing > t10 would have caused further data corruption. > > I suspect that even if we had managed to recover the > volume there was data corruption [look at the vxfs > errors below]. > > What I want to know is: > > a) Why was the wrong disk marked bad. > b) Why is there data corruption in these 'last erred' scenarios? > c) Is there any way to recover from this situation. > d) bonus question: where does disksuite store raid5 vol > status information [the /etc/md files?] > > Will Summarize. > > Regards, > Mike Kiernan > > [ more data/messages below ] > > 533 $>uname -a > SunOS random 5.8 Generic_108528-06 sun4u sparc > > 532 $>pkginfo -l SUNWmdu > PKGINST: SUNWmdu > NAME: Solstice DiskSuite Commands > CATEGORY: system > ARCH: sparc > VERSION: 4.2.1,REV=1999.11.04.18.29 > BASEDIR: / > VENDOR: Sun Microsystems, Inc. > DESC: Solstice DiskSuite Commands > PSTAMP: 11/04/99-18:31:00 > INSTDATE: Apr 04 2001 10:10 > VSTOCK: 258-6252-11 > HOTLINE: Please contact your local service provider > STATUS: completely installed > FILES: 57 installed pathnames > 11 shared pathnames > 12 directories > 20 executables > 11230 blocks used (approx) > > Did not have the Disksuite 4.2.1 product patch 108693-07, > but this patch contains nothing relevant. > > 534 $>metastat -p | grep d20 > d20 -r c1t2d0s0 c1t3d0s0 c1t4d0s0 c1t10d0s0 c1t11d0s0 c1t12d0s0 -k -i > 32b > > 535 $>metastat > d20: RAID > State: Needs Maintenance > Invoke: metareplace -f d20 c1t10d0s0 <new device> > Interlace: 128 blocks > Size: 358355376 blocks > Original device: > Size: 358356480 blocks > Device Start Block Dbase State Hot Spare > c1t2d0s0 5082 No Okay > c1t3d0s0 5082 No Okay > c1t4d0s0 5082 No Okay > c1t10d0s0 5082 No Maintenance > c1t11d0s0 5082 No Okay > c1t12d0s0 5082 No Last Erred > > the disks are: > sd24 = c1t10d0s0 -> ../../devices/pci@1f,4000/scsi@3,1/sd@a,0:a > sd25 = c1t11d0s0 -> ../../devices/pci@1f,4000/scsi@3,1/sd@b,0:a > sd26 = c1t12d0s0 -> ../../devices/pci@1f,4000/scsi@3,1/sd@c,0:a > > /var/adm/messages: > > Sep 11 14:57:51 random scsi: [ID 107833 kern.warning] WARNING: > /pci@1f,4000/scsi@3,1/sd@a,0 (sd24): > Sep 11 14:57:51 random SCSI transport failed: reason > 'incomplete': retrying command > Sep 11 14:57:51 random > Sep 11 14:57:51 random scsi: [ID 107833 kern.warning] WARNING: > /pci@1f,4000/scsi@3,1/sd@c,0 (sd26): > Sep 11 14:57:51 random SCSI transport failed: reason > 'incomplete': retrying command > Sep 11 14:57:51 random > Sep 11 14:57:52 random scsi: [ID 107833 kern.warning] WARNING: > /pci@1f,4000/scsi@3,1/sd@b,0 (sd25): > Sep 11 14:57:52 random SCSI transport failed: reason > 'incomplete': retrying command > Sep 11 14:57:52 random > <-- snip > Sep 11 14:58:58 random scsi: [ID 365881 kern.info] /pci@1f,4000/scsi@3,1 > (glm1): > Sep 11 14:58:58 random Cmd (0x3f094030) dump for Target 12 Lun > 0: > Sep 11 14:58:58 random scsi: [ID 365881 kern.info] /pci@1f,4000/scsi@3,1 > (glm1): > Sep 11 14:58:58 random cdb=[ 0x28 0x0 0x3 0x9f 0xd9 0xe6 > 0x0 0x0 0x4 0x0 ] > Sep 11 14:58:58 random scsi: [ID 365881 kern.info] /pci@1f,4000/scsi@3,1 > (glm1): > Sep 11 14:58:58 random pkt_flags=0x4000 pkt_statistics=0x61 > pkt_state=0x7 > Sep 11 14:58:58 random scsi: [ID 365881 kern.info] /pci@1f,4000/scsi@3,1 > (glm1): > Sep 11 14:58:58 random pkt_scbp=0x0 cmd_flags=0x8e1 > Sep 11 14:58:58 random scsi: [ID 365881 kern.info] /pci@1f,4000/scsi@3,1 > (glm1): > Sep 11 14:58:58 random Cmd (0x3f095b30) dump for Target 12 Lun > 0: > Sep 11 14:58:58 random scsi: [ID 365881 kern.info] /pci@1f,4000/scsi@3,1 > (glm1): > Sep 11 14:58:58 random cdb=[ 0x28 0x0 0x0 0x7b 0x1 0x4a > 0x0 0x0 0x4 0x0 ] > Sep 11 14:58:58 random scsi: [ID 365881 kern.info] /pci@1f,4000/scsi@3,1 > (glm1): > Sep 11 14:58:58 random pkt_flags=0x4000 pkt_statistics=0x61 > pkt_state=0x7 > Sep 11 14:58:58 random scsi: [ID 365881 kern.info] /pci@1f,4000/scsi@3,1 > (glm1): > Sep 11 14:58:58 random pkt_scbp=0x0 cmd_flags=0x8e1 > Sep 11 14:58:58 random scsi: [ID 365881 kern.info] /pci@1f,4000/scsi@3,1 > (glm1): > Sep 11 14:58:58 random Cmd (0x3f36f6c0) dump for Target 12 Lun > 0: > Sep 11 14:58:58 random scsi: [ID 365881 kern.info] /pci@1f,4000/scsi@3,1 > (glm1): > Sep 11 14:58:58 random cdb=[ 0x28 0x0 0x0 0xdb 0xd9 0x4e > 0x0 0x0 0x4 0x0 ] > Sep 11 14:58:58 random scsi: [ID 365881 kern.info] /pci@1f,4000/scsi@3,1 > (glm1): > Sep 11 14:58:58 random pkt_flags=0x4000 pkt_statistics=0x61 > pkt_state=0x7 > Sep 11 14:58:58 random scsi: [ID 365881 kern.info] /pci@1f,4000/scsi@3,1 > (glm1): > Sep 11 14:58:58 random pkt_scbp=0x0 cmd_flags=0x8e1 > Sep 11 14:58:58 random scsi: [ID 107833 kern.warning] WARNING: > /pci@1f,4000/scsi@3,1 (glm1): > Sep 11 14:58:58 random Disconnected tagged cmd(s) (3) timeout > for Target 12.0 > Sep 11 14:58:58 random glm: [ID 401478 kern.warning] WARNING: > ID[SUNWpd.glm.cmd_timeout.6018] > Sep 11 14:58:58 random scsi: [ID 107833 kern.warning] WARNING: > /pci@1f,4000/scsi@3,1/sd@c,0 (sd26): > Sep 11 14:58:58 random SCSI transport failed: reason 'reset': > retrying command > Sep 11 14:58:58 random > Sep 11 14:58:58 random scsi: [ID 107833 kern.warning] WARNING: > /pci@1f,4000/scsi@3,1/sd@c,0 (sd26): > Sep 11 14:58:58 random SCSI transport failed: reason 'timeout': > retrying command > Sep 11 14:58:58 random > Sep 11 14:59:03 random scsi: [ID 107833 kern.warning] WARNING: > /pci@1f,4000/scsi@3,1/sd@a,0 (sd24): > Sep 11 14:59:03 random Error for Command: > read(10) Error Level: Fatal > Sep 11 14:59:03 random scsi: [ID 107833 kern.notice] Requested Block: > 57638178 Error Block: 57638178 > Sep 11 14:59:03 random scsi: [ID 107833 kern.notice] Vendor: > SEAGATE Serial Number: 3DE047KQ > Sep 11 14:59:03 random scsi: [ID 107833 kern.notice] Sense Key: Not > Ready > Sep 11 14:59:03 random scsi: [ID 107833 kern.notice] ASC: 0x4 (<vendor > unique code 0x4>), ASCQ: 0x1, FRU: 0x2 > Sep 11 14:59:03 random md_raid: [ID 112651 kern.warning] WARNING: md > d20: read error on /dev/dsk/c1t10d0s0 > Sep 11 14:59:04 random proftpd[26721]: [ID 567783 daemon.notice] random > (dial521.walbrzych.dialog.net.pl[217.30.158.13]) - FSep 11 14:59:05 > random proftpd[26726]: [ID 567783 daemon.notice] random > (dial521.walbrzych.dialog.net.pl[217.30.158.13]) - FSep 11 14:59:08 > random scsi: [ID 107833 kern.warning] WARNING: > /pci@1f,4000/scsi@3,1/sd@b,0 (sd25): > Sep 11 14:59:08 random Error for Command: > read(10) Error Level: Fatal > Sep 11 14:59:08 random scsi: [ID 107833 kern.notice] Requested Block: > 32651478 Error Block: 32651478 > Sep 11 14:59:08 random scsi: [ID 107833 kern.notice] Vendor: > SEAGATE Serial Number: 3DE046S4 > Sep 11 14:59:08 random scsi: [ID 107833 kern.notice] Sense Key: Not > Ready > Sep 11 14:59:08 random scsi: [ID 107833 kern.notice] ASC: 0x4 (<vendor > unique code 0x4>), ASCQ: 0x1, FRU: 0x2 > Sep 11 14:59:08 random md_raid: [ID 371651 kern.warning] WARNING: md > d20: write error on /dev/dsk/c1t11d0s0 > Sep 11 14:59:08 random scsi: [ID 107833 kern.warning] WARNING: > /pci@1f,4000/scsi@3,1/sd@a,0 (sd24): > Sep 11 14:59:08 random Error for Command: > read(10) Error Level: Fatal > Sep 11 14:59:08 random scsi: [ID 107833 kern.notice] Requested Block: > 47632810 Error Block: 47632810 > Sep 11 14:59:08 random scsi: [ID 107833 kern.notice] Vendor: > SEAGATE Serial Number: 3DE047KQ > Sep 11 14:59:08 random scsi: [ID 107833 kern.notice] Sense Key: Not > Ready > Sep 11 14:59:08 random scsi: [ID 107833 kern.notice] ASC: 0x4 (<vendor > unique code 0x4>), ASCQ: 0x1, FRU: 0x2 > Sep 11 14:59:08 random md_raid: [ID 112651 kern.warning] WARNING: md > d20: read error on /dev/dsk/c1t10d0s0 > Sep 11 14:59:11 random proftpd[395]: [ID 567783 daemon.warning] random - > MaxInstances (50) reached, new connection denied. > Sep 11 14:59:11 random scsi: [ID 107833 kern.warning] WARNING: > /pci@1f,4000/scsi@3,1/sd@c,0 (sd26): > Sep 11 14:59:11 random Error for Command: > read(10) Error Level: Fatal > Sep 11 14:59:11 random scsi: [ID 107833 kern.notice] Requested Block: > 8441182 Error Block: 8441182 > Sep 11 14:59:11 random scsi: [ID 107833 kern.notice] Vendor: > SEAGATE Serial Number: 3DE046W4 > Sep 11 14:59:11 random scsi: [ID 107833 kern.notice] Sense Key: Not > Ready > Sep 11 14:59:11 random scsi: [ID 107833 kern.notice] ASC: 0x4 (<vendor > unique code 0x4>), ASCQ: 0x1, FRU: 0x2 > Sep 11 14:59:11 random md_raid: [ID 112651 kern.warning] WARNING: md > d20: read error on /dev/dsk/c1t12d0s0 > Sep 11 14:59:13 random proftpd[395]: [ID 567783 daemon.warning] random - > MaxInstances (50) reached, new connection denied. > Sep 11 14:59:13 random scsi: [ID 107833 kern.warning] WARNING: > /pci@1f,4000/scsi@3,1/sd@b,0 (sd25): > Sep 11 14:59:13 random Error for Command: > read(10) Error Level: Fatal > Sep 11 14:59:13 random scsi: [ID 107833 kern.notice] Requested Block: > 41253546 Error Block: 41253546 > Sep 11 14:59:13 random scsi: [ID 107833 kern.notice] Vendor: > SEAGATE Serial Number: 3DE046S4 > Sep 11 14:59:13 random scsi: [ID 107833 kern.notice] Sense Key: Not > Ready > Sep 11 14:59:13 random scsi: [ID 107833 kern.notice] ASC: 0x4 (<vendor > unique code 0x4>), ASCQ: 0x1, FRU: 0x2 > Sep 11 14:59:13 random scsi: [ID 107833 kern.warning] WARNING: > /pci@1f,4000/scsi@3,1/sd@a,0 (sd24): > Sep 11 14:59:13 random Error for Command: > read(10) Error Level: Fatal > > <-- snip > Sep 11 15:01:38 friko02 vxfs: [ID 710715 kern.warning] WARNING: msgcnt 1 > vxfs: mesg 037: vx_metaioerr - /dev/md/dsk/d20 file system meta data > write error > Sep 11 15:01:38 friko02 vxfs: [ID 486312 kern.warning] WARNING: msgcnt 2 > vxfs: mesg 025: vx_wsuper - /dev/md/dsk/d20 file system super-block > update failed > Sep 11 15:01:38 friko02 vxfs: [ID 863296 kern.warning] WARNING: msgcnt 3 > vxfs: mesg 017: vx_delbuf_flush - /fw1 file system inode 3280982 marked > bad > Sep 11 15:01:38 friko02 vxfs: [ID 885974 kern.info] vxfs msgcnt 3 offset > 0x00000000 41ed 5 258 258 > Sep 11 15:01:38 friko02 vxfs: [ID 885974 kern.info] vxfs msgcnt 3 offset > 0x00000010 0 2000 3b9e0a4d 8ccb6 > Sep 11 15:01:38 friko02 vxfs: [ID 885974 kern.info] vxfs msgcnt 3 offset > 0x00000020 3b9e0a4d 80966 3b9e0a4d 80966 > Sep 11 15:01:38 friko02 vxfs: [ID 885974 kern.info] vxfs msgcnt 3 offset > 0x00000030 10000 0 676a2f 0 > Sep 11 15:01:38 friko02 vxfs: [ID 885974 kern.info] vxfs msgcnt 3 offset > 0x00000040 1 1 0 92 > Sep 11 15:01:38 friko02 vxfs: [ID 885974 kern.info] vxfs msgcnt 3 offset > 0x00000050 0 0 0 0 > Sep 11 15:01:38 friko02 vxfs: [ID 885974 kern.info] vxfs msgcnt 3 offset > 0x00000060 141b240 1 0 0 > Sep 11 15:01:38 friko02 vxfs: [ID 885974 kern.info] vxfs msgcnt 3 offset > 0x00000070 0 0 0 0 > Sep 11 15:01:38 friko02 vxfs: [ID 885974 kern.info] vxfs msgcnt 3 offset > 0x00000080 0 0 0 0 > Sep 11 15:01:38 friko02 vxfs: [ID 885974 kern.info] vxfs msgcnt 3 offset > 0x00000090 0 0 0 0 > Sep 11 15:01:38 friko02 vxfs: [ID 885974 kern.info] vxfs msgcnt 3 offset > 0x000000a0 0 0 0 0 > Sep 11 15:01:38 friko02 vxfs: [ID 214594 kern.info] vxfs msgcnt 3 offset > 0x000000b0 0 0 > Sep 11 15:01:38 friko02 vxfs: [ID 486440 kern.warning] WARNING: msgcnt 4 > vxfs: mesg 025: vx_wsuper - /dev/md/dsk/d20 file systSep 11 15:01:38 > friko02 vxfs: [ID 217438 kern.warning] WARNING: msgcnt 5 vxfs: mesg 015: > vx_ilisterr - /fw1 file system can't > Sep 11 15:01:38 friko02 vxfs: [ID 962359 kern.warning] WARNING: msgcnt 6 > vxfs: mesg 031: vx_disable - /dev/md/dsk/d20 file sys > > <---snip > Sep 11 15:01:48 friko02 vxfs: [ID 885974 kern.info] vxfs msgcnt 36128344 > offset 0x00000000 80100 17481e04 191404f0 1c541a4c > Sep 11 15:01:48 friko02 vxfs: [ID 885974 kern.info] vxfs msgcnt 36128344 > offset 0x00000010 9bc1248 1b280000 17341fb4 16c4 > Sep 11 15:01:48 friko02 vxfs: [ID 885974 kern.info] vxfs msgcnt 36128344 > offset 0x00000020 e0c 165418f4 c8003b0 1cb01018 > Sep 11 15:01:48 friko02 vxfs: [ID 885974 kern.info] vxfs msgcnt 36128344 > offset 0x00000030 ff41ce4 1098 d900000 1a281e14Received on Thu Sep 13 12:21:57 2001
This archive was generated by hypermail 2.1.8 : Wed Mar 23 2016 - 16:32:30 EDT