SUMMARY: iostat output & disk distribution problem

From: John Lee <thesunlover2002_at_yahoo.com> Date: Thu Jun 20 2002 - 11:13:27 EDT · This archive was generated by hypermail 2.1.8 : Thu Mar 03 2016 - 06:42:47 EST

> Special thanks to Nasser Manesh:
> Sorry about the delayed summary.

> Answer 1 from Nasser:
"system time" in top or uptime or time refers to
system calls in general. There are a few processes
that dive into kernel mode and never returned, hence
commonly known as kernel processes, but I doubt if
they are the source of problem for you. scheduler,
page daemon and filesystem update daemon are the three
main ones (sched, pageout and fsflush on Solaris - pid
0, 2 and 3). Also nfsd (NFS server process) needs to
run in kernel mode (either multiple processes or
multiple LWPs depending on the version of your
Solaris), but you're talking about an NFS client.
So check the output of vmstat and vmstat -i for system
calls and interrupts. Network interrupts because of a
lots of connections (e.g. a system with high volume of
short-lived TCP connections - web server, proxy
server, etc), serial line interrupts (bad I/O board?),
etc. Run prstat to see who's usually on top. Truss
them to see who's doing a lot of sys calls. If you do
not mind sharing the outputs I can take a look at the
outputs and tell you if I see something out of wack.

> Answer 2 from Nasser:
If you are running 2.6 things could be a bit cloudy
because of the way the filesystem buffer cache
consumes the whole memory. Then an output of
/usr/ucb/ps axu can be a close replacement for prstat.
Truss traces the system calls a specific process
issues (optionally including its children), which is
basically the SVR4 replacement for the good old
"trace" or "ktrace". A good starting point is
(assuming you get a suspicious PID from ps) to do
this:
# truss -o /tmp/truss.out -f -p <PID>
To write the output to a file, follow forks (and
report children) and attach to process identified by
<PID>. You may want to try this without -o <file>
first just to see on screen how fast your process
makes system calls. Constantly making system calls is
not good, usually there's no reason for that.
Truss will stay attached as long as the process runs,
so since you're presumably trussing a daemon you'll
have to kill it a few seconds after starting it. Just
hit CTRL-C (or whatever your terminal interrupt is)
and truss will die. It will not harm the process and
is safe to run.

> Answer 3 from Nasser:
It's also a good idea to check your console, dmesg and
/var/adm/messages file, to see if you do not get
excessive interrupts because of a hardware failure
(like I said, serial port is a famous one for which
you may see errors from zs0). 

> Answer 4 from Wolfgang Kandek:
There is a Veritas vxstat command that can give you
more detailed information on a logical volume basis
(also raid-5 statistics, in case you are using that)
that might give you some more information. Looks to me
that you have some heavy I/O on these disks, NFS also
has a tendency of increasing the time spend in kernel
(system) mode. There is also nfsstat that could give
you some further clues on the type of operations that
are being used most frequently - nfsstat -z zeros the
counters then after some time (1 minute ?) call again
to get an idea on the frequency that read/writes,
directory lookups are used. 

> Answer 5 from William Hathaway:
The nfsXX are nfs mount points, not local disk, you
are probably better off using nfsstat  c to
troubleshoot them. 

Solution from myself:
By using "iostat -xpn", I found the mounted file
system which had the problem.
By adding two more CPUs, the situation has turned
better.

Original Question:
Hi, My previous question: We are using Solaris 2.6. I
have a Sun system (an NFS client) having CPU problem
now. The system/kernel processes have consumed 50% to
70% of the CPUs. Here is the info from TOP: "CPU
states: 7.0% idle, 20.0% user, 72.9% kernel, 0.1%
iowait, 0.0% swap Memory:1024M real, 479M free, 152M
swap in use, 1639M swap free". The user processes seem
fine in either 'top' or '/usr/ucb/ps -aux', and they
don't consume much CPU.
My new question: From the following output, it seems
that disks 'nfs30' and 'nfs31' have some I/O
distribution problems. Is it correct? How can I prove
it? The system uses Veritas Volume Manager.
# iostat -xc 5
.................
nfs28  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0  0
nfs29  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0  0
nfs30  277.9  0.8  199.9  5.8  0.0  0.3  1.0  3  25
nfs31  47.1  35.9  1413.1  1144.1  0.0  1.0  12.4  2
32
nfs32  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0  0
nfs33  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0  0
nfs34  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0  0
.................

Thank you all!
John
Yahoo! - Official partner of 2002 FIFA World Cup
http://fifaworldcup.yahoo.com
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers