SUMMARY: Continual I/O Wait, but little disk activity.

From: Aaron Dokey <adokey_at_reidtool.com> Date: Wed Mar 27 2002 - 11:54:59 EST · This archive was generated by hypermail 2.1.8 : Thu Mar 03 2016 - 06:42:38 EST

First off, thank all of you for your very helpful replies.  Most of you just
helped me prove that the problem is not at the OS level but rather at the
oracle or application level (which I've always believed).  However, I am
still unable to identify the source of my high IOWAIT time.

mpstat verifies what top is telling me, so top is not reporting false
statistics as was suggested by many replies.

It was also suggested that the source of my iowait may be network related,
but netstat -i doesn't show anything out of the ordinary.

Another reply suggested running vxstat to see the statistics for my VxFS
file system.  The output shows an average read time of around 5.5MS and a
write time of 0.7MS --yes, less than 1ms--.

So, I guess I really haven't solved anything but I have managed to narrow
down the field.  My next steps will be into oracle itself.

Thank you all,
-Aaron

--------------------
Aaron Dokey - MIS
Reid Tool Supply
2265 Black Creek Rd.
Muskegon, MI   49444 
(231) 777-3951
--------------------

Replies:

Erwin Fritz [efritz@glja.com]
I'd take a look at the Oracle side of things, using either the
utlbstat/utlestat scripts, or the newer perfstat utility. Chances are it's
Oracle itself that's the culprit, either through poorly-written queries,
missing indexes, or a misconfigured instance.

Kerekes, Ed [Ed_Kerekes@steris.com]
Are you running Hitachi-graph track?  If you are, take a look at "total
I/O rate" and "% write pending".

Casper Dik [Casper.Dik@Sun.COM]

>However in TOP there is a TON of iowait, and there is never any free CPU
>time while oracle is running:
>
>last pid:  5601;  load averages:  0.56,  0.73,  0.73
>09:54:53
>103 processes: 99 sleeping, 4 on cpu
>CPU states:  0.0% idle, 17.1% user,  9.2% kernel, 73.6% iowait,  0.0% swap
>Memory: 4096M real, 2887M free, 1063M swap in use, 5716M swap free

Strange question perhaps but have you tried recompiling top?

If vmstat says the CPU is idle, it really is.

Top gets its information from a kernel data structure and might be
wrong if the binary doesn't exactly match your kernel.

Casper

Jeff Kennedy [jlkennedy@amcc.com]
Is there any tool that will give you the fcal statistics?  Sun doesn't
really like jni fcal cards, even though I think they are superior to
qlogic.  I would start looking from the fcal out.

We had a similar problem with an EDA tool; it kept going to sleep after
a few minutes and would pick up again after a few more minutes.  Turned
out to be an nfs locking problem with the tool.  My system showed no
problems either but also didn't show a high wait.  Maybe not realted but
that's where I would start looking.

The other, slight, possibility is the filesystem itself.  Who configured
it?  Is it possible the block/stripe/depth sizes are all off?

~JK

Kevin Buterbaugh [Kevin.Buterbaugh@lifeway.com]

Aaron,

     Don't rely on top.  It's not a Sun tool; they don't support it.  I
have personally seen it give incorrect information.  Run mpstat instead and
see what it says for the I/O wait.  As an aside, Sun includes a top-like
tool called prstat in Solaris 2.8.

     If mpstat doesn't agree with top, believe mpstat.  Ask your app vendor
to produce evidence about the "slow I/O."  If mpstat agrees with top, then
you'll need to do some more digging, obviously.

     One thing I did notice in your iostat output is that the load is not
evenly spread across all the disks.  While those that do show activity are
not very busy, they could be "bursty," i.e. there could be brief spikes of
activity which causes things to slow down, but which don't last long enough
to show up in your stats.  What's the interval you're running iostat at?

     Another thing to look at is fsflush if your databases are in
filesystems (as I believe you indicate they are).  You may want to increase
the interval at which it runs to prevent brief bursts of activity.  HTH...

Kevin Buterbaugh
LifeWay

"Anyone can build a fast CPU.  The trick is to build a fast system." -
Seymour Cray

Brett Lanham [blanham@cleartrack.com]
I am sorry that I do not have the answer for you but I wanted to make sure
you followed up with a summary of what you learned from the list or could
pass on what you found out directly to me.  I have seen somewhat the same
thing you are seeing.  I am running Oracle 8.1.6 on Solaris 8 and our
database resides partly on the local drives and partly on external storage
(some emc san storage device) connected via FC adapter.  I have seen a lot
of CPU time consumed by oracle and also top reports a fair amount of iowait.
I have spent quite a bit of time looking into it but I am not extremely
experienced with this type of thing.  I eventually passed it on the my DBA
and asked him to fix his queries. :-)  BTW i've got version 3.5beta12 of
top.  What version are you running?

Brett 

Greg Gallagher [ggallag@foc.com]
Hi Aaron,

   yeah, that seems a little funny.  I had a similar problem a few
months ago, and it turned out that we were causing a large amount of
I/O but with very small amounts of data (i.e. several hundred I/O's a
second on a particular device but with just 1k on each I/O.  Turned
out to be a developer flippiantly running a flush() everytime they
wrote a line to a log file).

   Anywho, the only thing I see in your case is that NFS should be
looked into.  The service time is just a little high.  Check out the
NFS/NIS Tuning guide and look into nfsstat numbers.

   Also, since you're using VxFS, you may want to look at the stats
that way.  For example:

lancelot:/root/burns# vxstat
                        OPERATIONS           BLOCKS        AVG
TIME(ms)
TYP NAME              READ     WRITE      READ     WRITE   READ  WRITE
vol opt             957886    982686  37495760   7304136    5.7   12.1
vol rootvol          98621    180169   2879102    297291    6.6   13.2
vol swapvol          39605     12115    633680   3086432    9.9   83.4
vol usr             477696    516392  14468454    822445    6.1   16.5
vol var             404089   1462950  20596707  16080130    5.7    8.0

Hope this helps!

cheers,

Rakthet, Jay [Jay.Rakthet@caltech.edu]
Aaron,

You have an interesting problem.  I suggest you look at your network
bandwidth with 'netstat' that could be a source of IO wait.  Let me know if
you figure it out.

Jay Rakthet
Unix Systems Administrator
Administrative Technology Center, Caltech
626-395-3518
jay.rakthet@caltech.edu

Tim Chipman [chipman@ecopiabio.com]
Sounds like oracle is thrashing - ie - tons of I/O for your oracle processes
hitting the disks. (I'm assuming you have no other services running on this
box that would be generating I/O ?)

I'll be very interested to hear in your summary if other people believe this
to be true.

Certainly I've noted similar behaviour on our Oracle Server here - an e450
with 2 gigs ram, 4 x 400mhz CPUs, an a1000 as the direct-attached storage.
Typically when people complain of slow response time for orcale, top  (or
iostat) indicates iowaits > 50% even though oracle processes are never
absurdly high. I get the feeling that "top" process load reporting is
ignoring
IO-waits generated by a given process, ie, treating them as a separate issue
from actual CPU loading reported for that process (?)

Clearly I'm not absoltutely positive here though (hence my interest in the
pending summary :-)

--Tim Chipman

Dave Weis [djweis@sjdjweis.com]

One place you can look for more Oracle information is here:
http://www.tusc.com/oracle/books/overbook.html
The Oracle Performance Tuning books has lots of great stuff in it. 

dave
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers