A few days ago I posted the following plea for help:
> I have a problem that seems to pop up every few days. On a set of Suns running
> SunOS 4.0.3c with a 4/280 server and several diskless 4/110 and Sparcstation1
> clients, I occasionally get a process like
>
> USER PID %CPU %MEM SZ RSS TT STAT START TIME COMMAND
> taylor 2155 38.7 4.2 104 312 co D 11:03 317:15 Mail -N -B -f /tmp/MTda0
>
> chewing up the Ethernet bandwidth. On the file server, the (eight) nsfd
> processes are all very active, getting several % of the CPU each. /tmp on
> the diskless client with the problem is NFS-mounted on the server with the
> /etc/fstab entry
>
> hydrogen.utah.edu:/tmp /tmp nfs bg,rw,hard 0 0
>
> At the time that this problem occurred this morning, there were no MT* files in
> /tmp. As soon as I killed pid 2155, Ethernet traffic and nfsd activity returned
> to normal.
>
> Do any of you know how to prevent this from happening (short of removing
> the Mail image :-) )?
>
I don't have a final answer for this problem, but received several helpful
answers that may contain the solution. I also had several requests for any
replies from others who have the same problem. Here are the suggestions I got.
1. From Aydin Edguer, CES Department, Case Western Reserve University
(edguer@curie.ces.cwru.EDU):
> I do not know for sure what is happening, but your problem may be that you
> are not allowing interrupts on the nfs partition in question. Any partition
> which is hard mounted should also have the intr option. This permits the
> client to break off an nfs request if the server does not respond.
I haven't made this change yet, but probably will when I go to separate /tmp
files for each client (see below).
2. From Jay Williamson at Clemson University (jaysun@cs.clemson.EDU):
> We have been having the same problem with our Sun 3/50's. It seems to
> happen when SunView ends abnormally, in our case the is a (known) bug in
> SunView that causes it to core dump and this will cause the above problem.
> We have tried to help the problem by having the faculty members make sure
> these run away processes get killed if SunView dies on them.
This doesn't appear to be the problem here. No users have reported SunView
dying on them (but we all know how good users are about reporting problems :-)).
3. From Michael Baumann, Radiation Research Lab, Loma Linda Universtiy
Medical Center (proton!muon!baumann@ucrmath.ucr.EDU):
> Yes, see the previous comments on this list referring to the evils of
> sharing /tmp, /usr/tmp.
>
> In a word (3 words ?) _DON'T_DO_IT_ it is a Bad Thing, causing All Manner
> of Strange Problems :-) :-) :-)
>
> If you want to share tmp like that, create sub-dirs for each machine
> you have on the net
> eg in tmp on hydrogen.utah.edu, assuming clients foo and bar -
> /tmp
> /foo /bar
>
> Then in foo's fstab
> hydrogen.utah.edu:/tmp/foo /tmp nfs bg,rw,hard 0 0
>
> and similarly on bar:
> hydrogen.utah.edu:/tmp/bar /tmp nfs bg,rw,hard 0 0
>
> A kludge, but it prevents temp files from walking on each other.
I will be implementing this suggestion shortly.
4. From Skip Schaller, Steward Observatory, University of Arizona
(skip@as.arizona.EDU):
> You should find an error message somewhere (/var/adm/messages ?) about an
> NFS error - stale file handle (client thought he had a file open, but
> the server says it was deleted). The problem arises when someone on a
> diskless station stays logged in over night, leaving his mailtool running
> (open or iconic). You probably have a crontab job which runs in the wee
> hours of the morning to delete files in /tmp (like the MT..... files that
> Mail has open). I don't know of a good work around. We tell our users
> to logout when they go home. We also have modified the crontab entry to
> delete files out of /tmp only if they are more than 24 hours (in case of
> someone working at 5am).
I haven't seen any errors like this. This problem has also occurred for
someone who was only logged in for a few minutes, so I don't think this is it.
5. From Andy (capmkt!bandy@uunet.UU.NET):
> My bet is that normal users can't unlink their /usr/spool/mail/$USER
> files.. Yet, if you don't have "set keep" in /usr/lib/Mail.rc, if a
> user deletes *all* of his mail, then the Mail [/usr/ucb/[Mm]ail]
> program will try to unlink it. Over and over and over again. And it
> eats CPU like it's going out of style while it's trying to do this.
This is also a possibility, although the last time the problem occurred
the user was sure he hadn't tried to delete all his mail.
6. From Steve Rumsby (steve%maths.warwick.ac.uk@NSFnet-Relay.AC.UK):
> Reminds me of a bug I fixed in the mail system on a 4.3BSD machine, in
> /bin/mail I think. Something to do with very long To: addresses causing it
> to get very confused and loop forever - fixed sized buffer I think.
>
> This is all from memory though, so I could have some of the details wrong.
> It might have been 4.1BSD, fer instance.
>
> I've never seen this on a Sun. Look at the address of the message its trying
> to deliver and see if it very long.
Not the problem here, but thanks anyway.
My thanks to all those who responded.
Sam Cole Chemistry Computer Center University of Utah
Internet: cole@chemistry.utah.edu Bitnet: sjcole@utahcca
This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:03:55 CDT