Thanks for all your suggestions, there were some good points in there. Thanks to Ric Anderson, John Leadeham, Tobias Nutt, Joe Fletcher, Bhaskar G, Pawel Osiczko, and Grzegorz Bakalarski for their prompt replies, and apologies for the late summary. It's only today I'm completely happy, and it's involved moving a bunch of data into the SAN (it needed doing anyway). Suggestions were that having UFS filesystem 95% full is a bad idea in the first place, because of the overhead looking for free inodes / free data blocks. Also UFS doesn't do so well on filesystems with milliions of files. Hence the move of a chunk of data to the SAN. Fragmentation can apparently still be an issue, the only real cure for that would be a dump | restore. A messy option when you're talking 500 GB data. If a controller had actually failed, this can trigger the array to switch through to write-through mode, clobbering performance. In my case `show cache-param' still showed `mode: write-back', but definitely worth checking. UFS can throttle writes in the case of high write-rates, which is tweakable. A failed / failing drive can hurt performance. All my drives were good. UFS journalling is important, and was turned on. The optimisation mode can make a big difference, and think before you create a volume, because you can't change it later! I have mine optimised for random access, which seems about right for a mail spool. There's were also a couple of comments that the 3510 isn't a great performer in the first place, to check for bad memory, and to make sure the firmware's up to date. I'm a happy bunny at the moment, and firmware upgrades mean more downtime, so I'm going to schedule that for Christmas. Anyway, I finally seem to have got it sorted, and it appears to have been due to the controllers being in a dodgy state, i.e. this sccli> show redundancy-mode Primary controller serial number: 8040592 Primary controller location: Lower Redundancy mode: Active-Active Redundancy status: Failed Secondary controller serial number: 8009331 sccli> On the suggestion of a guy from Sun, I tried sccli> unfail The Redundancy status changed to Scanning, and then to Detected, and then I lost one of my LUNs. Bugger. Then he suggested sccli> reset controller and the machine panicked and came back to single-user because of loss of metadb quorum. Bugger, bugger. I should have known better than that, I would have known that would happen if I hadn't been panicking myself. Anyway, I shut the machine down, power-cycled the array, waited for the array to look healthy, and brought the machine back up. Redundancy status is now "Enabled", asvc_t is a 10th of what it was, throughput (kw/s) is 2-3 times what it was, and all's back well with the world. Thanks again everybody, Rob The original problem: Rob McMahon wrote: > I've got a machine here which has recently (over the last few weeks) > degenerated into being unusable at times. It's a V890 running Solaris > 10, cyrus-imap (2.2.13) and squirrelmail. The mail partitions are on a > 3510 FC, 500GB a piece, and RAID 5. The filesystems are UFS, and the > problematic one is 95% full. When it becomes unusable, iostat shows the > asvc_t times hitting 1000, 2000 or more. %b is pinned at 100% all the > time. %w hits 60% on the one partition. At quiet times I don't seem to > get better than: > > r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > 71.8 258.8 618.7 4334.6 0.0 26.0 0.0 78.7 0 100 > c6t600C0FF0000000000855613BE6F2D900d0 > 35.6 129.6 322.3 2068.4 0.0 0.0 0.0 0.0 0 0 > c6t600C0FF0000000000855613BE6F2D900d0.fp1 > 36.2 129.2 296.5 2266.2 0.0 0.0 0.0 0.0 0 0 > c6t600C0FF0000000000855613BE6F2D900d0.fp3 > > which is lower throughput than I'd expect. Truss shows creates, renames > and fdsyncs (which cyrus-imap seems to like using a lot) taking seconds. > > sccli does show > > sccli> show redundancy-mode > Primary controller serial number: 8040592 > Primary controller location: Lower > Redundancy mode: Active-Active > Redundancy status: Failed > Secondary controller serial number: 8009331 > sccli> > > and I have a call in about that with Sun, although they seem to be > arguing about maintenance levels as normal. > > Really, I'm a bit desperate out here, and I'd like to hear any > suggestions or pointers to things I might not have thought about. > > Any input gratefully received. > > Thanks, > > Rob > > -- E-Mail: Rob.McMahon@warwick.ac.uk PHONE: +44 24 7652 3037 Rob McMahon, IT Services, Warwick University, Coventry, CV4 7AL, England _______________________________________________ sunmanagers mailing list sunmanagers@sunmanagers.org http://www.sunmanagers.org/mailman/listinfo/sunmanagersReceived on Thu Nov 29 10:03:58 2007
This archive was generated by hypermail 2.1.8 : Thu Mar 03 2016 - 06:44:07 EST