Many thanks to the following for ideas and suggestions, and the quick responses. From: Tim Hespe <t.hespe@unsw.edu.au> From: Dan Astoorian <djast@cs.toronto.edu> From: Scott Croft <secroft@micron.com> From: Kristian Styrvoll <kristian.styrvoll@eterra.no> From: Tony Walsh - Field Service Engineer <Tony.Walsh@Sun.COM> From: Matthew Stier <Matthew.Stier@fnc.fujitsu.com> From: "Thomas M. Payerle" <payerle@physics.umd.edu> From: "Kevin Buterbaugh" <Kevin.Buterbaugh@lifeway.com> From: "Mortensen, Henrik" <henrik.mortensen@csfb.com> From: "Keplinger, Michael A" <michael.keplinger@nmci-isf.com> From: Eric Shafto <eshafto@mac.com> I ended up using a combination of these ideas and summarize as follows: metadb -i (to check that the metadb's are on both disk slices s7) prtvtoc /dev/rdsk/c0t1d0s2 > /var/adm/doc/20021007.c0t1d0s2.vtoc metadetach d0 d2 (this failed metadetach: : d0: attempt an operation on a submirror that has erred components) metadetach d10 d12 metadetach d20 d22 metadetach d30 d32 metadb -d c0t1d0s7 Pulled the old disk Put the new disk fmthard -s /var/adm/doc/20021007.c0t1d0s2.vtoc /dev/rdsk/c0t1d0s2 metattach d0 d2 (failed metattach: : d2: invalid unit) metattach d30 d32 metattach d20 d22 metattach d10 d12 metadb -a -c 3 c0t1d0s7 To get over the d0 d2 problem: metareplace -e d0 c0t1d0s0 After all the syncing we finally have a sytem back again!!! Once again many thanks to all, Clive Elsum ************************************************************************** From: Tim Hespe <t.hespe@unsw.edu.au> I have a book, one of the Sun Blueprint series, called "Boot Disk Management" which covers exactly this scenario. It is well worth getting hold of. It covers both Disk Suite and Veritas setups. You seem to be on the right track. The only thing you don't seem to have taken into account is the removal and restoration of metaDB replicas. The sequence of events is basically ; metadetach metaclear metadb - d # to remove metaDB replicas on the affected drive use format or fmthard (using a previoulsy saved copy of the vtoc) to slice the disk metainit metattach metadb -a # to create metaDB replicas on the new disk As you have shown in your procedure, metareplace can be used instead of the metaclear->metainit->metattach sequence of commands, but for some reason they don't use it. Probably for the sake of clarity. If you give me your fax number I can send the pages from the book (4) rather than me trying to paraphrase them. ************************************************************************** From: Dan Astoorian <djast@cs.toronto.edu> You have two copies of the data for /, and both of them have reported errors. There are no guarantees: mirroring does not protect against failures of both copies of your data. You may wish to run "metastat -t" to see how long each submirror has been offline. (It's possible that d1 failed a long time ago, and nobody noticed.) You almost certainly don't want to use metaonline and metaoffline. When you use metaoffline, the system keeps track of updates to the other mirror, so it knows which blocks to update when it's brought online. If you replace the disk, the system will assume that any blocks that haven't changed since the disk was taken offline are still in sync. Since they're not, you'll get data corruption. > metaoffline d0 d2 I would venture to guess that this command will fail; so would "metadetach." d2 is the "last erred" copy of the data, which means it's less outdated than the data on d1. Consult the DiskSuite 4.2.1 User's Manual, available from docs.sun.com. In particular, see page 133 ("Submirror States"). What I would try is: Attach a working disk at SCSI target 2, format it the same as sd0, and try: metareplace d0 c0t0d0s0 c0t2d0s0 as per the "invoke" command in the metstat output. If I couldn't attach three disks at the same time, I would remove the disk c0t0d0 (after first metadetaching d12, d21, and d32), and instead use the command: metareplace -e d0 c0t0d0s0 Be warned, however, that the system may not allow you to do even this metareplace command, because there is no error-free copy of your data anywhere on the system for DiskSuite to use to write a new copy of the mirror. In such a case, you may have to metaclear the metadevices and re-metainit them. Unfortunately, you can't do that while the metadevices are in use. You may ultimately need to go to your backup tapes, and/or reinstall your operating system. ************************************************************************** From: Scott Croft <secroft@micron.com> We have used the metadetach for detaching the mirrors, remove any hot spare devices, remove the copy of the database on the bad disk if you put it there, take the system down, replace the disk. reboot the system with a -r (shouldn't have to run devfsadm). Copy format from primary disk to secondary, re-create the database on the new disk, metattach and you should be done. Run metastat to see progress. ************************************************************************** From: Kristian Styrvoll <kristian.styrvoll@eterra.no> Go to http://docs.sun.com go to Solstice DiskSuite 4.2.1 User's Guide, search for How to Recover From a Boot Device Failure (Command Line) It works for me. ************************************************************************** From: Tony Walsh - Field Service Engineer <Tony.Walsh@Sun.COM> I have not seen a summary regarding this, so I will proffer the following advice:- 1 - Use 'metadetach <mirror> <submirror>' for ALL slices on the faulty drive not already in 'Needs Maintenance' state. 2 - If there are metadb datasets on the failing drive (which there should be), they need to be removed with the 'metadb -d <component>' command (eg. metadb -d /dev/dsk/c0t1d0s7 to remove al metadb's in that slice). 3 - Remove faulty drive and replace with new drive. The E420R uses hot-swappable drives so no power outage is required. 4 - Reformat the drive by copying the VTOC of the good drive onto the new drive. Use the format command to copy the VTOC or prtvtoc output as input to fmthard. 5 - Use 'metareplace -e <mirror> <component>' for each slice to be mirrored again. (eg. metareplace -e d0 /dev/dsk/c0t1d0s0) 6 - Re-establish metadb's on new drive with 'metadb <options> <component>' comand (eg. metadb -c 3 /dev/dsk/c0t1d0s7) Step 5 will take some time to complete the synchronisation process, but step 6 does not have to wait. You should also wait for the sync process to finish and then schedule a reboot at your convenience if possible. If you only have the 2 drives in this system, then it is recommended that you have 3 metadb's on each drive so that you will always have a quorum should one drive completely fail. These metadb's are usually put on slice 7 but any spare slice with at least 30 MB available (for SDS 4.2.1) is recommended (30 MB is the maximum required but this configuration could get away with 10MB as a minimum). If you don't have 3 metadb's on the good drive, fix that first before carrying on with this process. Your steps would therefore be as follows:- metadetach d0 d2 (if it has not already done so) metadetach d10 d12 metadetach d20 d21 metadetach d30 d32 metadb -d c0t1d0s7 (assuming you have metadb's on this slice) Replace disk "hot swap" NO POWER OFF Format the disk as per prtvtoc of old disk metareplace -e d0 c0t1d0s0 metareplace -e d10 c0t1d0s1 metareplace -e d20 c0t1d0s3 metareplace -e d30 c0t1d0s4 metadb -c 3 c0t1d0s7 (if they came from this slice previously) ************************************************************************** From: Matthew Stier <Matthew.Stier@fnc.fujitsu.com> metaoffline/metaonline, expects that the metapartition being offline'd/online'd basically hasn't changed, and that only changes recorded since the offlining need to be run against the offline'd partition. This is not what you want. You need to metadetach and metaclear all metapartitions that are present on that drive. Once all the paritions are clear, you can: 1) Replace the drive 2) Partition it 3) Metainit the metapartitions on the drive. 4) Metattach the metapartitons to recreate your mirrors. Once the metattach has completed syncing, the task will be finished. ************************************************************************** From: "Thomas M. Payerle" <payerle@physics.umd.edu> I find the advise on http://www.slacksite.com/solaris/disksuite/SDSrecovery.html to be pretty good. Believe I even gave it a test run once. IT also may be excessive as is referring to boot/root devices, and assumes a 2 disk mirror setup (wherein there are complications as you will only have half, not half+1 metadb replicas up). I believe the metadetach, format, (metattach), metareplace works (I won't say correct procedure cause the other may work as well). The question about power on or off depends more on the hardware than on disksuite --- is your hardware hot swappable. IF is not, or are unsure, you should power down after detaching, replace the drive, and power back up. IF hot swappable, can just replace after power up. If the replacement drive is ID'ed like the original (e.g. same SCSI chain, ID, etc., e.g. the /dev/cntndnsn names would be unchanged), I don't believe metattach is needed, and you can just do a metareplace -e (see metareplace man page). The drive must be labelled first to match to old prtvtoc info (actual, probably the partition affected just needs to be same size, but usually mirroring entire drives, same effect). ************************************************************************** From: "Kevin Buterbaugh" <Kevin.Buterbaugh@lifeway.com> Sun's short procedure is wrong. Your procedure is correct, replacing the metaoffline and onlines with metadetach and metattach, respectively. As an aside, Sun does have an excellent "Blueprints" book on this (covers mirroring the root disk with both DiskSuite and Veritas). It's called "Sun Blueprints Guide to High Availability" by Kobert. Well worth it, IMHO... ************************************************************************** From: "Kevin Buterbaugh" <Kevin.Buterbaugh@lifeway.com> metaonline only makes sense when the disk you're online'ing already contains most of the data for the mirror (i.e. if that was the disk previously metaoffline'd); when swapping you need a full sync-up. I'd yank c0t0d0 (you can metadetach all mirrors on that disk if you don't trust ODS, but I've never needed to); swap c0t0d0; prtvtoc /dev/dsk/c0t1d0s2 | fmthard -s - /dev/rdsk/c0t0s0d2 (if you have the same geometry, otherwise do it manually with format or re-label); metareplace -e all the now broken mirrors (or the one broken and metattach the rest if you metadetached them). To fix the last-err'ed slice, you can initially try to metareplace -e it as it is (since both disks are on the same scsi bus the error might be bus related). If it still fails, you'll have to do the above for c0t1d0 as well. ************************************************************************** From: "Keplinger, Michael A" <michael.keplinger@nmci-isf.com> I recently had some similar problems. Are you looking to replace the whole drive? If so, you should be able to do so with just the metareplace -e command. However from looking at your md.tab file it doesn't look like your second mirror is attached. Verify this with metastat -p but the output should look like this for d0, note the main difference in bold d0 -m d1 d2 1 d1 1 1 c0t0d0s0 d2 1 1 c0t1d0s0 If this is okay and you are planning on replacing the whole drive then I am pretty sure the best way to do it is to just swap it out then copy over the partition table. You don't even need to create filesystems. Then for each of the mirrors run the following command metareplace -e d0 c0t1d0s0 metareplace -e d10 c0t1d0s1 metareplace -e d20 c0t1d0s3 metareplace -e d30 c0t1d0s4 You will notice that the first parameter after the -e flag is the mirror name not the submirror name This will cause the mirrors to resync, unless they weren't attached to begin with, in which case you will want to run metattach d0 d2 metattach d10 d12 metattach d20 d22 metattach d30 d32 Once you are done with this you will want to recreate the metadevice database on that disk See what databases you have with the metadb command then for the databases on that disk you will want to destroy them and then recreate them metadb -d c0t1d0s? metadb -a -c3 c0t1d0s? (I usually put 3 copies of the DB on each disk when there are only 2 disks) ************************************************************************** From: Eric Shafto <eshafto@mac.com> If you're replacing it with the same disk model in the same slot at the same SCSI ID, you don't need to do the devfsadm or drvconfig or disks. I've done this before several times. There may be a quicker way to do it but here's what worked for me: 1. metadb to remove the metadbs from the failing disk. 2. metadetach each of the submirrors on the failing disk. 3. shut down and replace the disk. 4. format, replicate the partition table from the good disk to the new disk. 5. metadb to add the metadbs to the new disk 6. metattach each of the submirrors on the new disk (you don't have to create them, since they don't really exist on the disk. Simply having them in the metadb is sufficient). 7. installboot on disk 2 (saves you a boot cdrom when disk 1 fails). Before doing step 1, make sure you don't leave yourself with too few metadbs. If you have only one or two on each disk, and you don't have any on any other disks, then you don't have enough and your recovery will be more complicated. Never leave yourself with less than three metadbs. ************************************************************************** --------------------------------------------------------------------- Clive Elsum BAppSc, RHCE Systems Engineer - Information Technology Group CSIRO Atmospheric Research PMB 1, Aspendale, Victoria, Australia 3195 Phone : (+61 3) 9239 4509 Fax: (+61 3) 9239 4444 E-mail Clive.Elsum@csiro.au --------------------------------------------------------------------- Original question: Hi , I am having problems getting a definitive approach to replacing a mirrored system disk on our Sun 420R. We are running Solaris 8 on a Sun 420R with 2 18Gb disks mirrored via Disksuite 2.4.1. The second disk is showing errors and needs to be replaced. The problem is I keep getting conflicting information on the correct procedure. Sun basically gave "short shift" saying use metaoffline, metaonline, metareplace. 1 - use the command metaoffline <mirror name> ...to offline the mirror (the secondary one. ) 2 - Shutdown and replace the faulty disk and run devfsadm or drvconfig ; disks 3 - Up the system and run the command metaonline <mirror name> 4 - when disks are synced run the command metareplace -e The mirror will then eventually recover . This does not seem correct, as metaonline would enable at bootup and a boot -r would reconfigure the disks etc. Also no mention of formatting the disk. Other stuff I have looked at indicate metadetach then replace faulty disk (some say power down others say on-line) format the disk as per failed disk prtvtoc, then metattach, then metareplace. I really need a definitive method of attack that will work. Given the md.tab file is: # Mirror for / # d0 -m d1 d1 1 1 /dev/dsk/c0t0d0s0 d2 1 1 /dev/dsk/c0t1d0s0 # # # Mirror for swap # d10 -m d11 d11 1 1 /dev/dsk/c0t0d0s1 d12 1 1 /dev/dsk/c0t1d0s1 # # # Mirror for /usr/local # d20 -m d21 d21 1 1 /dev/dsk/c0t0d0s3 d22 1 1 /dev/dsk/c0t1d0s3 # # # Mirror for /it # d30 -m d31 d31 1 1 /dev/dsk/c0t0d0s4 d32 1 1 /dev/dsk/c0t1d0s4 Would the correct procedure be: metaoffline d0 d2 metaoffline d10 d12 metaoffline d20 d21 metaoffline d30 d32 Replace disk "hot swap" NO POWER OFF Format the disk as per prtvtoc of old disk metaonline d0 d2 metaonline d10 d12 metaonline d20 d22 metaonline d30 d32 metareplace -e d2 c0t1d0s0 metareplace -e d12 c0t1d0s1 metareplace -e d22 c0t1d0s3 metareplace -e d32 c0t1d0s4 OR do I replace metaoffline with metadetach and metaonline with metattach and if so are there any other steps I am missing. Much thanks in advance Clive Output info shows: # iostat -E sd0 Soft Errors: 48 Hard Errors: 0 Transport Errors: 0 Vendor: IBM Product: DDYST1835SUN18G Revision: S96H Serial No: 157444 Size: 18.11GB <18110967808 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 48 Predictive Failure Analysis: 0 sd1 Soft Errors: 48 Hard Errors: 35 Transport Errors: 16 Vendor: IBM Product: DDYST1835SUN18G Revision: S96H Serial No: 10K705 Size: 18.11GB <18110967808 bytes> Media Error: 30 Device Not Ready: 0 No Device: 5 Recoverable: 0 Illegal Request: 48 Predictive Failure Analysis: 0 sd6 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: TOSHIBA Product: DVD-ROM SD-M1401 Revision: 1007 Serial No: 06/22/00 Size: 18446744073.71GB <-1 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 sd30 Soft Errors: 1 Hard Errors: 0 Transport Errors: 0 Vendor: STK Product: OPENstorage 9176 Revision: 0401 Serial No: 1T03310196 Size: 365.06GB <365061079040 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 1 Predictive Failure Analysis: 0 sd46 Soft Errors: 0 Hard Errors: 1 Transport Errors: 0 Vendor: STK Product: OPENstorage 9176 Revision: 0401 Serial No: 1T02811801 Size: 365.06GB <365061079040 bytes> Media Error: 0 Device Not Ready: 0 No Device: 1 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 sd68 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: STK Product: OPENstorage 9176 Revision: 0401 Serial No: 1T03310196 Size: 220.09GB <220091908096 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 sd74 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: STK Product: OPENstorage 9176 Revision: 0401 Serial No: 1T02811801 Size: 220.09GB <220091908096 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 sd330 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: STK Product: Universal Xport Revision: 0401 Serial No: 1T03310196 Size: 0.02GB <18874368 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 sd474 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: STK Product: Universal Xport Revision: 0401 Serial No: 1T02811801 Size: 0.02GB <18874368 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0 st15 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: STK Product: 9840 Revision: 1.30 Serial No: .109 st16 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: STK Product: 9840 Revision: 1.30 Serial No: .109 st17 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: STK Product: 9840 Revision: 1.30 Serial No: .109 st18 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: STK Product: T9940A Revision: 1.30 Serial No: .210 st19 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: STK Product: T9940A Revision: 1.30 Serial No: .210 st20 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: STK Product: T9940A Revision: 1.30 Serial No: .210 # metastat d0: Mirror Submirror 0: d1 State: Needs maintenance Submirror 1: d2 State: Needs maintenance Pass: 1 Read option: roundrobin (default) Write option: parallel (default) Size: 16779432 blocks d1: Submirror of d0 State: Needs maintenance Invoke: metareplace d0 c0t0d0s0 <new device> Size: 16779432 blocks Stripe 0: Device Start Block Dbase State Hot Spare c0t0d0s0 0 No Maintenance d2: Submirror of d0 State: Needs maintenance Invoke: after replacing "Maintenance" components: metareplace d0 c0t1d0s0 <new device> Size: 16779432 blocks Stripe 0: Device Start Block Dbase State Hot Spare c0t1d0s0 0 No Last Erred d10: Mirror Submirror 0: d11 State: Okay Submirror 1: d12 State: Okay Pass: 1 Read option: roundrobin (default) Write option: parallel (default) Size: 4198392 blocks d11: Submirror of d10 State: Okay Size: 4198392 blocks Stripe 0: Device Start Block Dbase State Hot Spare c0t0d0s1 0 No Okay d12: Submirror of d10 State: Okay Size: 4198392 blocks Stripe 0: Device Start Block Dbase State Hot Spare c0t1d0s1 0 No Okay d20: Mirror Submirror 0: d21 State: Okay Submirror 1: d22 State: Okay Pass: 1 Read option: roundrobin (default) Write option: parallel (default) Size: 8392072 blocks d21: Submirror of d20 State: Okay Size: 8392072 blocks Stripe 0: Device Start Block Dbase State Hot Spare c0t0d0s3 0 No Okay d22: Submirror of d20 State: Okay Size: 8392072 blocks Stripe 0: Device Start Block Dbase State Hot Spare c0t1d0s3 0 No Okay d30: Mirror Submirror 0: d31 State: Okay Submirror 1: d32 State: Okay Pass: 1 Read option: roundrobin (default) Write option: parallel (default) Size: 5955968 blocks d31: Submirror of d30 State: Okay Size: 5955968 blocks Stripe 0: Device Start Block Dbase State Hot Spare c0t0d0s4 0 No Okay d32: Submirror of d30 State: Okay Size: 5955968 blocks Stripe 0: Device Start Block Dbase State Hot Spare c0t1d0s4 0 No Okay # prtvtoc /dev/rdsk/c0t1d0s0 * /dev/rdsk/c0t1d0s0 partition map * * Dimensions: * 512 bytes/sector * 248 sectors/track * 19 tracks/cylinder * 4712 sectors/cylinder * 7508 cylinders * 7506 accessible cylinders * * Flags: * 1: unmountable * 10: read-only * * First Sector Last * Partition Tag Flags Sector Count Sector Mount Directory 0 2 00 0 16779432 16779431 1 3 01 16779432 4198392 20977823 2 5 00 0 35368272 35368271 3 4 00 20977824 8392072 29369895 4 0 00 29369896 5955968 35325863 7 0 00 35325864 42408 35368271 # Thanks in advance Clive --------------------------------------------------------------------- Clive Elsum BAppSc, RHCE Systems Engineer - Information Technology Group CSIRO Atmospheric Research PMB 1, Aspendale, Victoria, Australia 3195 Phone : (+61 3) 9239 4509 Fax: (+61 3) 9239 4444 E-mail Clive.Elsum@csiro.au --------------------------------------------------------------------- _______________________________________________ sunmanagers mailing list sunmanagers@sunmanagers.org http://www.sunmanagers.org/mailman/listinfo/sunmanagersReceived on Wed Jul 10 23:47:03 2002
This archive was generated by hypermail 2.1.8 : Thu Mar 03 2016 - 06:42:48 EST