Hi, Its been a while coming so apologies for the delay. We've not been able to pin down a definitive explanation but all our testing confirms our suspicion that SATA is the cause of the problems. We ran some tests with zones clones onto SAS disks and saw an immediate improvement with respect to the stalling behaviour. Based on that we converted one of the DL380s to use external EMC SAN storage, moved the zones from the SATA drives to the SAN and now we can thrash several zones to death simultaneously and nothing chokes. Peak throughput is down on a per-zone basis so individually the throughput is slower. However now we can run multiple parallel jobs with no freezing up so overall we have a win. Incidentally I had some contact from a couple of people using the x4500 boxes expressing similar issues on machines with SATA based configs. As with our setup things are fine up to a point, beyond which there is a sharp drop in usability. Cheers Joe ====================== originally.. =================================== Looking for some insights on a performance issue. Platform is an HP DL380-G6, dual quad core, 64G RAM, 2x 146Gb disks hardware mirrored for the sys disk plus 4x 1Tb SATA drives as an additional logical drive, also hardware mirrored. Controller is a P400i, 512Mb bbwc. Server houses 4 zones. The 2Tb volume forms the basis of a zpool. Each zone sits in a ZFS directory on that pool. The zones will run a BI app (SAS) which is numerically and I/O intensive. What we're seeing is that when one SAS job gets busy the whole system locks while its doing its disk work. So for example we have zones A thru D. A kicks off a job. Someone else either on global zone or in one of the other child zones runs anything (w, ls, date) they can wait for upto 30s to get output and a prompt back. Trying to run jobs in 2 zones simultaneously causes run times to extend markedly. The disks are pushing 200Mb/s + sustained once they get busy (peak observed at a shade under 300Mb/s). %busy is 100%, blocking is 0 and service times are around 25ms. Drivers are latest and greatest. Read/write cache ratios on the controller are 25%:75%. CPU usage levels overall are <15%. Things get even worse if we try to do some network transfers at the same time (eg scp). Machines are on 1000Base ethernet. I've run some tools like the the rather funky zilstat.ksh which indicate zfs itself isn't struggling. I'm aware obviously that the arrangement I've built means there are common disks, controllers and so on servicing all the zones. What does seem unusual is the way everything seems to block, even things in global zone which ought not to be causing any significant I/O contention. Essentially it looks like we can thrash the disks via a single thread and get nothing else done whilst its doing it. I'm in the process of building a comparative system using HBAs/SAN instead of internal RAID and also comparing ZFS and Veritas so see if we can isolate a specific element. as being the problem. Will update with the results. Anyone got any suggestions in the meantime? _______________________________________________ sunmanagers mailing list sunmanagers@sunmanagers.org http://www.sunmanagers.org/mailman/listinfo/sunmanagersReceived on Wed Nov 24 09:29:05 2010
This archive was generated by hypermail 2.1.8 : Thu Mar 03 2016 - 06:44:17 EST