Thanks to: Guy Purcell, Scott Howard, Neil Harrison, James Brown, Dan Lorenzini, Tom Payerle, Gregg Mackenzie, Richard Eisenman, and John Eisenschmidt. Turns out the disk must have been bad. Following the advice below, I tried to metareplace the mirrors with themselves, but the resync failed and I started getting SCSI errors. So, I metadetached the mirrors on the problem disk, shutdown, slapped in a new disk, partitioned the new disk with the same slice info (FYI: I tried a fmthard with the info from the failed drive, but since the new disk was a different type/geometry, this failed. So, I recreated the partitions by hand... makeing sure they were slightly larger than the old partitions). Then, simply re-attaching the mirrors rebuilt the info. Thanks very much to everyone for their help. Mark ======================================================================= Sometimes, but it almost never tells you what the problem was, so you won't know how to fix it the next time it happens. Personally, I'd try an approach somewhere in between 2 & 3 first. If the disk has physical problems, then #2 is a waste of time. But if the problems aren't severe enough to require replacement, then #3 is overkill--at least for now. (If the problems are physical, I'd definitely want the disk replaced; it's just better to do replacements when you _want_ to than when you _have_ to.) I'd metadetach the submirrors on the bad disk (all of 'em). Then reformat the disk to find/remove bad regions. And finally, metattach the submirrors again. All of that can be done without taking any services down. If format reports tons of bad blocks, or you continue to see SCSI errors, replace the disk. You don't say what system houses the disk in question; if it's hot-swappable, you should be able to do a complete disk replacement & mirror resync while the system is up & running. -- Guy (guy@extragalactic.net) ======================================================================= There's two real options you can take here... 1. Reattached the mirrors. The best way do do this is with metareplace : metareplace -e d2 c2t1d0s0 metareplace -e d8 c2t1d0s3 2. Swap the disk. Personally, I'd go for number 1 and see what happens. If the disk really is bad, it will fail again either during the resync or shortly afterwards, at which point you'll be no worse off than you are now and you'll have to take options 2. Scott ======================================================================= First thing to try with your disksuite problem would be to do a vurtual replace of the dodgy metadevices i.e. for d1 submirror do "metareplace -e d2 c2t1d0s0" for d7 sunmirror do "metareplace -e d8 c2t1d0s3 " A "metastat" command should show the mirrors syncing, there is no need to reboot.... Hope this helps Neil Harrison ======================================================================= Yes I have that happen all the time I don't know why. Just a slice of a disk will go off line but other sices are fine. This is how you correct it. 1) Bring the mirror into the main window 2) Right Click on the offeneding slice that is offline and click on info. 3) Click on ENABLE 4) Commit transacction If all is fine with the disk is should begin mirroring again and all will be fine. I hope this is actually your problem. In fact I just did it 5 minutes ago myself. <James Brown> ======================================================================= The first thing I would try is to use format(1M) to "repair" the disk. The safest way to do this is to use the "read" command of the "analyze" menu. If it finds a bad block it will attempt to "repair" it (actually it maps it to a spare sector). This is the default behavior unless you change it using the "setup" command. If the read pass goes through without errors you might consider doing one of the write options. In this case you can use setup to limit the range of the test to the affected partitions. Since they are in "maintenance" mode, disksuite will not be updating them while you run your test. I have used this many times with good success. However, sometimes it does not work, so you must replace the disk. If that is the case, you need to use metadetach rather than metaoffline for all metadevices on the affected drive, and then metareplace -e after the new drive is installed and properly partitioned. Regards, Dan Lorenzini Greenwich Capital Markets ======================================================================= Don't believe this will work, but who knows. Reminds me of the old joke of what an IT person does when they get a flat tire--- turn off and restart the car to see if it goes away. Assuming the disk is OK, I believe this will solve the problem. You could also go a bit further; detach the mirror, then re-init the mirror and reattach. I would probably do the re-init since isn't much more work, and should really clean up any data corruption issues (assuming a good disk). There should not be any problem doing this even on root. After all, the mirrors are bad, so should not be in use by anything anyway. Even if were in use, this is the point of mirroring. The question is whether the old disk is bad or not, and whether the cost of a new disk exceeds the cost of a possible disk failure. Since you are mirroring to begin with, sounds like an important system, and I would tend to replace the disk (I might put the old disk to duty in a less critical situation). BTW, you should be able to offline the working mirrors on the problem disk, replace the disk (if not hot swappable, will require rebooting. You should ensure that you have more than 50% of the database replicas on other disks before rebooting, and delete the replicas on the problem disk). Then run metareplace for each of the mirrors and should start resyncing. <Tom Payerle> ======================================================================= I would be inclined to first try a fourth option: - metareplace the "bad" submirrors in place: metareplace -e d2 c2t1d0s0 metareplace -e d8 c2t1d0s3 If disksuite kicks it/them back out again, you probably do have something wrong with the disk, but you could also try option #5: - detach/unmirror the bad submirrors (it's been awhile since I've had to try this, so I can't remember if it will let you detach a bad submirror...maybe with the -f option...I dunno); - metaclear the bad submirrors; - either fsck or newfs (your choice) the bad partitions, the idea being to "clean up" any residual filesystem bugginess; - metainit the bad submirrors; - metattach the new submirrors. If that doesn't work, option #2 would be my next choice, then option #3. Option #1 doesn't work because the mddb retains its state between reboots. It would still think that the components are bad. Good luck. Gregg Mackenzie ======================================================================= I would probably do: Detach on failing disk (metadetach -f ...) Clear failing disk (metaclear ...) Get rid of Replica Dbs on failing disk (metadb -d ....) Edit /etc/vfstab and change back to standard device names Shutdown, remove failing disk, reboot (and check that everything comes up OK) Shutdown, put in a new disk (be sure its clean; if it happens to have Replica Dbs on it from some other previous configuration you may have some trouble). Reboot. Setup the mirror configuration again ... Richard Eisenman ======================================================================= This might be a little late, but I thought it might help. We have some V880s running DiskSuite 4.2, and we've seen one quirk. When we were building the systems and we were rebooting a lot (screwing with kernel parameters) we found if the system came up and the mirrors were out of sync despite a notmal reboot, they would be out of sync every time we rebooted. So we'd reboot, DS would tell us they need maintenance, we'd metareplace the disk with itself, let it do a full rebuild until DS said they were consistant, then reboot again and the same disk would be out of sync. If we detached the mirror and reattached (letting it sync obviously) it would fix the problem and every reboot after that would come up clean. Strange, but I've seen it on a couple different Solaris installs on a couple different boxes. Aside from that, DS is great. If you're still having problems it might be worth detaching and reattaching the mirror to see if it fixes the problem before you do something crazy like reboot. Best, John ======================================================================= Original Question: Hello: I have a problem with a DiskSuite 4.2 mirror, and I'd like some advice on how to tackle the problem. I have a few two-way mirrors. I recently discovered that some of the sub-mirrors went into a "Maintenance/Critical" state. One mirror is mounted as / and the other /var In each case, the failed sub-mirror is on the same disk. However, the same disk also has another submirror which is working just fine, so I'm guessing the disk may not actually be bad (then again, it could become a problem). I have included my metastat, metadb, and syslog output detailing the errors at the bottom of this email. In each instance, the bad submirror is on c2t1d0. The Metadb I also had on this disk is bad, but I've got six other ones spread across two other controllers. My question is this: What is my best approach? I can see three options: 1) Reboot and hope the problem clears itself up :) Does this sctually work sometimes? 2) Offline the submirrors and then "online" them. Since one of the submirrors is for / I'm not exactly sure if this is a good idea. If it matters, the problem disk is not the primary boot disk. Is this a good option to try before breaking the root mirror and going through the hassle? 3) Detach/Unmirror the root, reboot, edit the correct files, come up unmirrored, slap in a new disk, etc. Again, I'm not sure the disk is actually bad since another submirror is OK. But there could be some bad sectors. This is my first problem under DiskSuite in about two years, so I guess I;ve been pretty lucky. It obviously saved my butt, and I don't want to make matters worse by doing something stupid. Any help is greatly appeciated. I have an hour of scheduled downtime starting in about 8 hours :) Will summarize. Thanks very much, Mark # metadb -i flags first blk block count a m p luo 16 1034 /dev/dsk/c0t0d0s7 a p luo 16 1034 /dev/dsk/c0t1d0s7 a p luo 16 1034 /dev/dsk/c0t2d0s7 a p luo 16 1034 /dev/dsk/c1t0d0s7 a p luo 16 1034 /dev/dsk/c1t1d0s7 a p luo 16 1034 /dev/dsk/c1t2d0s7 W p l 16 1034 /dev/dsk/c2t1d0s7 # metastat | more d2: Mirror Submirror 0: d0 State: Okay Submirror 1: d1 State: Needs maintenance Pass: 1 Read option: roundrobin (default) Write option: parallel (default) Size: 24578400 blocks d0: Submirror of d2 State: Okay Size: 24578400 blocks Stripe 0: Device Start Block Dbase State Hot Spare c2t0d0s0 0 No Okay d1: Submirror of d2 State: Needs maintenance Invoke: metareplace d2 c2t1d0s0 <new device> Size: 35549760 blocks Stripe 0: Device Start Block Dbase State Hot Spare c2t1d0s0 0 No Maintenance d8: Mirror Submirror 0: d6 State: Okay Submirror 1: d7 State: Needs maintenance Pass: 1 Read option: roundrobin (default) Write option: parallel (default) Size: 4097920 blocks d6: Submirror of d8 State: Okay Size: 4097920 blocks Stripe 0: Device Start Block Dbase State Hot Spare c2t0d0s3 0 No Okay d7: Submirror of d8 State: Needs maintenance Invoke: metareplace d8 c2t1d0s3 <new device> Size: 4097920 blocks Stripe 0: Device Start Block Dbase State Hot Spare c2t1d0s3 0 No Maintenance May 9 08:57:22 emsdb3 scsi: [ID 107833 kern.warning] WARNING: /pci@4,2000/scsi@1/sd@1,0 (sd46): May 9 08:57:22 emsdb3 SCSI transport failed: reason 'incomplete': retrying command May 9 08:58:27 emsdb3 scsi: [ID 365881 kern.info] /pci@4,2000/scsi@1 (glm3): May 9 08:58:27 emsdb3 Cmd (0x708fc320) dump for Target 1 Lun 0: May 9 08:58:51 emsdb3 scsi: [ID 107833 kern.warning] WARNING: /pci@4,2000/scsi@1/sd@1,0 (sd46): May 9 08:58:51 emsdb3 Error for Command: write(10) Error Level: Fatal May 9 08:58:51 emsdb3 scsi: [ID 107833 kern.notice] Requested Block: 12028560 Error Block: 12028560 May 9 08:58:51 emsdb3 scsi: [ID 107833 kern.notice] Vendor: SEAGATE Serial Number: 3AK0E8CY May 9 08:58:51 emsdb3 scsi: [ID 107833 kern.notice] Sense Key: Not Ready May 9 08:58:51 emsdb3 scsi: [ID 107833 kern.notice] ASC: 0x4 (<vendor unique code 0x4>), ASCQ: 0x1, FRU: 0x2 May 9 08:58:51 emsdb3 md_stripe: [ID 641072 kern.warning] WARNING: md: d1: write error on /dev/dsk/c2t1d0s0 May 9 08:58:56 emsdb3 md_mirror: [ID 104909 kern.warning] WARNING: md: d7: /dev/dsk/c2t1d0s3 needs maintenance May 9 08:58:56 emsdb3 md_mirror: [ID 104909 kern.warning] WARNING: md: d1: /dev/dsk/c2t1d0s0 needs maintenance _______________________________________________ sunmanagers mailing list sunmanagers@sunmanagers.org http://www.sunmanagers.org/mailman/listinfo/sunmanagersReceived on Tue May 28 16:16:03 2002
This archive was generated by hypermail 2.1.8 : Thu Mar 03 2016 - 06:42:44 EST