I got a couple replies that spoke of the drives possibly being incompat w/the version of OS but Sun sais otherwise. When I spoke to Sun, we did come up with the bugid/patch below that was released in their current recommended patch cluster this month. Hasn't been proven in my office but looks promising (yep, my customer's do -i 5 :-) ... mha Original post: On Tue, 20 Nov 2001, Michael Auria wrote: > We are seeing an abnormally high failure of Ultra 10 boot drives. We've > seen this on a couple different IDE drives and seems to be related to the > Ultra 10's. We're running recommended patched Solaris 2.5.1 w/2 IDE drives > (9 & 20 gig from Sun). The failure seems to be triggered by doing a > shutdown on the machine. The typical symptom is dad0 not selected or some > other nonsense related to the drive not being available. Alot of times, > we'll see BAD BLK messages. If I put in a new drive and install the > "crashed" drive as the slave, I can mount it (normally -r), fsck it > (normally encounter bad block) and normally recover the data (that I need > anyway; haven't tried all, no need). Tried doing an installboot on it which > worked but doesn't get past the not selected noise. > > Anyone know of anything like this ? bugid report: Bug Id: 4380416 Product: sunos Category: kernel Subcategory: ddi Bug/Rfe/Eou: bug State: fixed Development Status: FIX Synopsis: init 5 corrupts filesystems on ultra-10 440MHz on 2.5.1 systems Keywords: 10, 2.5.1, 440, 5, 5.5.1, MHz, corrupt, filesystem, fsck, init, no-s8+, u10, ultra, ultra-10 Severity: 2 Severity Impact: 1 Severity Functionality: 0 Priority: 2 Responsible Manager: martie Responsible Engineer: scua Description: Customer has made a system for his customers built upon a ultra10 system. The system uses regular ide disks. Customer has got reports from around the world that init 5 blows the file system. In most extreme cases their customers have to run fsck manually. We made a test on the customer system and could reproduce the problem with uiltra-10 immediately. init 5 seems to always generate a new fsck when the system boots up. All our tests ended up in that it was impossible to get this to work. The disk got its own power supply and then everything worked. I had some conversation about this on the net: James Litchfield : > > That's because one of the Solaris engineers spent a lot of time > ensuring that it would work in 2.6. It's also one of the reasons > that moving to later releases is a good idea. Why is this customer > still on 2.5.1? > I need to correct my statement. The fixes to make all of this work reliably went into Solaris 7. The fact that it worked on 2.6 may well be serendipity. > Jim > --- Sounds like your customer is running into the write-back (instead of write-through) cache on the EIDE disks we ship... the power can be removed before the dirty buffers are written to disk resulting in the fsck when the machine is rebooted. Shiv: is there a patch for SunOS 5.5.1 for that (is it the correct diagnosis?)? Cheers!greg Customer wants this fixed in a patch. Going up to solaris 8 is not an option because customer uses an application using XGL XIL which is EOL in solaris 2.6. :::::::::::::: prtdiag-v.out :::::::::::::: System Configuration: Sun Microsystems sun4u Sun Ultra 5/10 UPA/PCI (UltraSPARC-IIi 440MHz) System clock frequency: 110 MHz Memory size: 128 Megabytes CPU Units: Frequency Cache-Size Version A: MHz MB Impl. Mask B: MHz MB Impl. Mask ---------- ----- ---- ---------- ----- ---- 440 2.0 12 9.1 ======================IO Cards========================================= dev_find_node() Could not find any IO bus System Configuration: Sun Microsystems sun4u Memory size: 128 Megabytes System Peripherals (Software Nodes): SUNW,Ultra-5_10 Justification: extremely urgent to fix for customer since customer has delievered about 90 systems around the world since new year. Work around: Suggested fix: *** /home/scua/ws/bug4380416/26/webrev/usr/src/uts/common/cpr/cpr_mod.c- Mon Jun 18 12:19:13 2001 --- cpr_mod.c Mon Jun 18 11:38:11 2001 ------------------------------------------------------------------------ *** 18,27 **** --- 18,28 ---- #include <sys/systm.h> #include <sys/cpr.h> #include <sys/cpr_impl.h> extern int cpr_is_supported(void); + extern void reset_leaves(void); extern struct mod_ops mod_miscops; static struct modlmisc modlmisc = { &mod_miscops, "checkpoint resume" ------------------------------------------------------------------------ *** 167,179 **** --- 168,187 ---- if (fcn == AD_CPR_TESTZ || fcn == AD_CPR_TESTNOZ) { mdboot(0, AD_BOOT, ""); /* NOTREACHED */ } + /* * If cpr_power_down() succeeds, it'll not return. + * Reset devices prior to power down; in particular, + * devo_reset op function is used to flush the IDE disk + * cache before powering down the disk. The devo_reset + * entry point was previously unused and deemed not to + * be used as per Solaris DDI spec". */ + reset_leaves(); if (fcn != AD_CPR_TESTHALT) cpr_power_down(); halt("Done. Please Switch Off"); /* NOTREACHED */ *** /home/scua/ws/bug4380416/26/webrev/usr/src/uts/sun4u/io/autoconf.c- Mon Jun 18 12:19:14 2001 --- autoconf.c Fri May 25 14:32:43 2001 ------------------------------------------------------------------------ *** 454,466 **** static int reset_leaf_device(dev_info_t *dev, void *arg) { struct dev_ops *ops; - if (DEVI(dev)->devi_nodeid == DEVI_PSEUDO_NODEID) - return (DDI_WALK_PRUNECHILD); - if ((ops = DEVI(dev)->devi_ops) != (struct dev_ops *)0 && ops->devo_cb_ops != 0 && ops->devo_reset != nodev) { CPRINTF2("resetting %s%d\n", ddi_get_name(dev), ddi_get_instance(dev)); (void) devi_reset(dev, DDI_RESET_FORCE); --- 454,463 ---- State triggers: Accepted: yes Evaluated: yes Evaluation: The fix is in reset_leaf_device. Remove the following line: if (DEVI(dev)->devi_nodeid == DEVI_PSEUDO_NODEID) return (DDI_WALK_PRUNECHILD); This bug is related to bug 4337637; which results to write-data still in the disk cache not being flushed as a result of a shutdown. This problem is only seen in IDE disks since SUN doesn't support disk write-caching for SCSI drives. The solution involves writing an entry point in the IDE driver (dad) to explicit issue a disk cache flush command (devo_reset dev_ops). This also requires changes in the kernel to call this entry point upon shutdown. Which should be done in the following routine: /*ARGSUSED1*/ static int reset_leaf_device(dev_info_t *dev, void *arg) { struct dev_ops *ops; if (DEVI(dev)->devi_nodeid == DEVI_PSEUDO_NODEID) return (DDI_WALK_PRUNECHILD); if ((ops = DEVI(dev)->devi_ops) != (struct dev_ops *)0 && ops->devo_cb_ops != 0 && ops->devo_reset != nodev) { CPRINTF2("resetting %s%d\n", ddi_get_name(dev), ddi_get_instance(dev)); (void) devi_reset(dev, DDI_RESET_FORCE); } return (DDI_WALK_CONTINUE); } Since the kernel classifies the IDE driver as DEVI_PSEUDO_NODEID, we needed to remove that "if" statement. This is not a problem for other devices since that entry point is not supported as ween in the dev_ops man pages: devo_reset Reset device. (Not supported in this release.) Set this to nodev. Note also that in 2.8+, this "if" check has already been removed. [dp@eng 2001-05-01] If the evaluation is correct, this doesn't look like a kernel/boot bug. Could you move it to the correct cat/subcat? Thanks! Commit to fix in releases: 5.5.1, 5.6, 5.7 Fixed in releases: 5.5.1 Integrated in releases: Verified in releases: Closed because: Incomplete because: Duplicate of: Introduced in Release: Root cause: Program management: Fix affects documentation: no Exempt from dev rel: no Fix affects L10N: no Interest list: fs@central, jan.wester@sweden, thomast@sweden Patch id: 103640-38 Comments: thomas.tornblom@Sweden 2000-10-19 I have been working with Jan on this case. We did a simple test tonight where we booted the system with kadb and set a breakpoint at "prom_power_off". We then ran "init 5" and when the system stopped at "prom_power_off", all system chores are done and the remaining issue is to remove power. We had the system continue into "prom_power_off" and power was removed. No fsck was run when the system subsequently booted. We tried this about half a dozen times and at no time was the system fsck:ed, even though the pause at the breakpoint only lasted a few seconds, so it seems the disk flushes its cache relatively quickly. One feasible workaround, which I assume the cust could be made to accept, is if we can add a simple nvram patch that delays "power-off" a few seconds. We tried something like: --- : power-off " Waiting for power off" cr type 03000 ms power-off ; --- as an nvramrc script, and while it works fine when called manually from the "ok" prompt, the system did not call this function from prom_power_off. The original definition of "power-off" was called. I'm definitely no Forth or OBP hacker so someone with better knowledge in this field might come up with something. [ Eric.Taylor@West 01/09/01 15:08 PST ] If I don't hear any objections, I'll close this bug in a week since it only happens on 2.5.1/2.6 and has not been escalated. Summary: This bug pertains to systems with IDE drives. Unlike SCSI disks, IDE drives have their internal write cache enabled. Whenever the system is power down, the data in the disk cache is not flushed causing possible data corruption. The fix requires writing a new entry point in the IDE driver that will send a disk flush command before powering down the disk. This entry point uses the devo_reset function of dev_ops; which has never been used and deemed to be unsupported based on Solaris ddi specs. To avail of this entry point, reset_leaf_device (sun4u/io/autoconf.c) has been modified as well as cpr (common/cpr/cpr_mod.c). x86 version of autoconf.c already has the modification. The fix will be available for 2.5.1 & 2.6. There will be another RTI for 2.7. The 3 deliverables (kernel, IDE driver, cpr) have to be present for the fix to be complete, as flushing the disk cache occurs both when the system is powered down and during suspend/resume. This is problematic for 2.5.1 as cpr is unbundled. The review team deems the limited nature of the fix on 2.5.1 due to the expense of generating an additional cpr patch for 2.5.1. A patch with the cpr fixes could be generated for 2.5.1 if escalated in the future. Therefore, this rti should generate 2 patches (kernel & cpr) for 2.6 and just kernel patch for 2.5.1. There will be another driver patch from the IDE folks for each of the Solaris version. Related bug is 4435428; which is being addressed by the IDE fix. _______________________________________________ sunmanagers mailing list sunmanagers@sunmanagers.org http://www.sunmanagers.org/mailman/listinfo/sunmanagersReceived on Wed Nov 28 00:52:19 2001
This archive was generated by hypermail 2.1.8 : Thu Mar 03 2016 - 06:42:29 EST