Hi, *** Problem *** I've recently upgraded VxVM / VxFS in my environment from 3.2 / 3.4 to 4.1 MP1. We also necessarily upgraded Solaris 8 to Solaris 9 in the same exercise. I did both through brand new installation rather than upgrading old software and both are fully patched from last month's Sun Gold CD. root@ myhost # pkginfo -l VRTSvxfs PKGINST: VRTSvxfs NAME: VERITAS File System CATEGORY: system,utilities ARCH: sparc VERSION: 4.1,REV=4.1B18_sol_GA_s10b74L2a BASEDIR: / VENDOR: VERITAS Software DESC: Commercial File System PSTAMP: VERITAS-FS-4.1.1.0-2005-09-30-4.1MP1=119301-02 INSTDATE: Sep 18 2006 20:36 HOTLINE: (800) 342-0652 EMAIL: support@veritas.com STATUS: completely installed FILES: 234 installed pathnames 32 shared pathnames 6 linked files 47 directories 76 executables 5 setuid/setgid executables 58086 blocks used (approx) root@ myhost # pkginfo -l VRTSvxvm PKGINST: VRTSvxvm NAME: VERITAS Volume Manager, Binaries CATEGORY: system ARCH: sparc VERSION: 4.1,REV=02.17.2005.21.28 BASEDIR: / VENDOR: VERITAS Software DESC: Virtual Disk Subsystem PSTAMP: VERITAS-4.1_p3.1:2005-10-24 INSTDATE: Sep 18 2006 20:23 HOTLINE: 800-342-0652 EMAIL: support@veritas.com STATUS: completely installed FILES: 828 installed pathnames 23 shared pathnames 18 linked files 98 directories 413 executables 294561 blocks used (approx) root@myhost # root@ myhost # uname -a SunOS myhost 5.9 Generic_118558-28 sun4u sparc SUNW,Sun-Fire root@myhost # We have found severe performance degradation when copying large amounts of data between VxVM controlled VXFS filesystems: Formerly copying 90Gb took about 1 hour but it now takes about 3 hours between the same filesystems. I have also proven that I can copy outwith Vx control between UFS filesystems in about one half of the time taken to copy between VxVM controlled VXFS filesystems. I have not upgraded the disk groups from their original version (90) to the latest version (120?) or disks from version 2.2, and I suspect that may have _something to do with our problem. [[Added to summary -- In reality, the performance of our ERP and Oracle DB dropped significantly as well but I did not detail the scope of the issue in the original post both for clarity and commercial reasons, and because I was confident we were looking at one root cause for both issues due to the High I/O figures exhibited for both copying and processing in general.]] root@myhost # vxdg list mydg Group: mydg dgid: 1022708368.1258.myhost import-id: 1024.104 flags: version: 90 alignment: 512 (bytes) ssb: off detach-policy: global dg-fail-policy: invalid copies: nconfig=default nlog=default config: seqno=0.4714 permlen=1486 free=1442 templen=29 loglen=225 [snip] root@myhost # vxdisk list Disk_13 Device: Disk_13 devicetag: Disk_13 type: auto hostid: myhost disk: name= id=1068375953.1534.gla1c102 group: name=movextest id=1068375404.1528.gla1c102 info: format=sliced,privoffset=1,pubslice=4,privslice=3 flags: online ready private autoconfig autoimport pubpaths: block=/dev/vx/dmp/Disk_13s4 char=/dev/vx/rdmp/Disk_13s4 privpaths: block=/dev/vx/dmp/Disk_13s3 char=/dev/vx/rdmp/Disk_13s3 version: 2.2 iosize: min=512 (bytes) max=2048 (blocks) public: slice=4 offset=0 len=75489280 disk_offset=4096 private: slice=3 offset=1 len=2047 disk_offset=0 update: time=1159263160 seqno=0.467 ssb: actual_seqno=0.0 headers: 0 248 configs: count=1 len=1486 logs: count=1 len=225 Defined regions: config priv 000017-000247[000231]: copy=01 offset=000000 enabled config priv 000249-001503[001255]: copy=01 offset=000231 enabled log priv 001504-001728[000225]: copy=01 offset=000000 enabled Multipathing information: numpaths: 1 c6t2900006022mypath30303135d0s2 state=enabled root@myhost # Veritas Support are on the case and I'm testing everything I can think of, but I would appreciate any information that anyone can provide in terms of experience or advice on this upgrade and, most importantly, the potential root cause of our problem. I already understand that root causing performance problems can be a tricky business and many factors can play a part is a problem like this, but my test results give a strong indication that the Veritas upgrade software/method are a contributor to the root cause. I will summarise any responses asap. *** Assistance Received *** I'll start by thanking Dave Foster and Kevin Starling for their capable assistance. ********************************************************* Try using vxtunefs to tune VxFS, which will override the I/O blocksize employed by the OS or 3rd party application. These settings can be made permanent using /etc/vx/tunefstab. A likely setting to use is /dev/vx/dsk/... read_pref_io=256k Vxtunefs can change this on the fly so testing is simple. Dave Foster ********************************************************* My hunch is that the I/O degradation is caused by a Verias DMP/Stoage configuration conflict. I have experienced this problem with other Sun StorEdge Arrays in the past e.g. 6120s, 6230s, 6920s. For these arrays Veritas released an Array Support Library (essentially a package that assist DMP in managing active/active arrays). Unfortunately, I can't see an ASL available for this product. Sorry I can't offer more help. Kevin ********************************************************* I didn't receive that many responses on this issue, which is a little surprising as I would have thought this problem would be quite common in the SA community. *** Solution *** Ultimately, Veritas Support came up with the actual solution to the problem which was to run vxtunefs to change the value of VxFS tunable parameter discovered_direct_iosz from to 10Mb. i.e., # vxtunefs -o discovered_direct_iosz=10485760 /u08 The above was carried out for all our Veritas filesystems and the performance of Copy and everything else on our machine (including Oracle and our ERP) improved Significantly and more or less instantaneously. I now plan to change this permanently in /etc/vx/tunefstab for all vxfs filesystems. The explanation of this solution is as follows: - The software we replaced was Solaris 8 and vxfs 3.2 in which the default chunk size of data processed by the cp command was 256k - In Veritas 3.2, the default size of discovered_direct_iosz was 256k as well, meaning that any data chunks of >256k would be processed via direct i/o as explained at http://docs.hp.com/en/B2355-90692/vxtunefs.1M.html: discovered_direct_iosz Any file I/O requests larger than the discovered_direct_iosz are handled as discovered direct I/O. A discovered direct I/O is unbuffered like direct I/O, but it does not require a synchronous commit of the inode when the file is extended or blocks are allocated. For larger I/O requests, the CPU time for copying the data into the buffer cache and the cost of using memory to buffer the I/O becomes more expensive than the cost of doing the disk I/O. For these I/O requests, using discovered direct I/O is more efficient than regular I/O. The default value of this parameter is 256K. Therefore: - cp was processed via buffered i/o prior to the upgrade (performance=OK) - cp was processed via direct i/o after the upgrade (performance=Not OK) You should be thinking: "But direct i/o should be faster than buffered i/o"!! According to the explanation above, that is true, however as with many performance issues there was a further layer of complexity: iostat reported disk I/O rate down on normal with Busy% and Average values higher than normal - effectively we were giving the disks more to do than they could cope with - previously the disk i/o was being throttled by the vxfs buffering. - The new default chunk size for the cp command (not sure if this is a feature of copy or Solaris) was 8Mb - this meant that, suddenly, the copy chunks (controlled by vxfs) were processing via direct i/o not buffered as before .. this put an immediate additional load on our disk arrays which caused them to be too busy (the effect of this was much more severe that the effect of vxfs throttling that was previously happening when the chunks were being buffered) - The reality is that we were not cp'ing all over our storage array, but much other activity (including that of Oracle, ERP, etc.) were now processing chunks of data via direct i/o rather than vxfs buffering and so the problem was much more general. Increasing the value of discovered_direct_iosz from 256Kb to 10Mb took the throttling from the disks and back to memory/vxfs (for example 8Mb < 10 Mb for cp) - Disk I/O dropped to normal levels and performance improved to normal levels *** The morals of the story *** The "bottleneck" is now the vxfs buffering, but the performance is better with that bottleneck than with our disks/controllers. Stress testing is very important. The 10Mb value may not be ideal, so we will likely test further (perhaps higher) values, however don't say I didn't warn you about http://www.sun.com/blueprints/0400/ram-vxfs.pdf#search=%22vxtunefs%20tre nches%22. # iostat -xnz is a great was of looking at your Busy luns/disks # vxdisk -e list is a great way of seeing storage vxvm sees Complete With WWNs May the force be with you, Barry Visit us at http://www.aggreko.com Confidentiality Notice: This communication and any accompanying attachments contain confidential information intended for a specific individual and purpose. This communication is private and protected by law. If you are not the intended recipient, you are hereby respectfully notified that any disclosures, copying, forwarding or distribution, or the taking of any action based on the contents of this communication is strictly prohibited. _____________________________________________________________________ This email has been scanned by the MessageLabs Email Security System. For more information please visit http://www.messagelabs.com/email ______________________________________________________________________ _______________________________________________ sunmanagers mailing list sunmanagers@sunmanagers.org http://www.sunmanagers.org/mailman/listinfo/sunmanagersReceived on Thu Sep 28 06:31:19 2006
This archive was generated by hypermail 2.1.8 : Thu Mar 03 2016 - 06:44:01 EST