My original question was how to fix NFS timeouts I was getting during
moderate to heavy loads on an NFS server with a 100 mbps SUNSwift card
plugged into a 100 mbps port on a BayNetworks Ethernet Workgroup Switch.
Only a soft reset of the switch seems to make the timeouts go away once
they start -- decreasing the load by 90% doesn't even make the timeouts
go away once they start.
Thanks to the several folks who replied and gave me some ndd commands
that improved throughput even more and some kernel tweaks to try (included
below). I tried 'em all, but am still having troubles.
I've found that if the traffic is all small packets (ie. bing's default
108 byte packets), everything works great. But when I increase the packet
size to something larger (causing lots of continuation packets -- or
is it frames?) I *immediately* start seeing timeouts and getting really
lousy performance.
A SUN Techrep suggested tweaking the kernel to increase the inter-packet
gap (the pause between frames -- see InfoDoc 14273). This decreased the
overall bandwidth quite a bit, but I *still* see timeouts on traffic with
large packet sizes when the load starts getting heavy.
At this point, I'm baffled, and am calling SUN again and placing a call
with the BayNetworks vendor (and BayNetworks) to see if I can find some way
to fix this. Thanks again to all who responded, especially those whose
messages are included below...
Brent Bice
bbice@persistence.com
> From: Tony Jago <tony@fit.qut.edu.au>
>
> I think someone should add this question to the FAQ list as it seems
> everybody asks it but nobody ever summarises. People have been haveing
> trouble with the the Bay Switches but with a few tweeks we have had good
> results. First of all, set your switch to full duplex if you can. In any
> case decide which was you want to go as the problem with the bay gear is
> one of 2 things:
>
> 1. The BaySwitch 301 (which I am testing at the moment) does not do full
> duplex so if you try to pump it full duplex then it work but it works
> slow.
>
> 2. The 28115 (and others) don't support the auto detect of speed and
> duplex so if your card on the other end is expecting it it will have a
> royal screwup and again run very slow.
>
> So now you have your switch set to full duplex you have to set the sun to
> full duplex and disable that pesky auto detection crap. I use this:
>
> ndd -set /dev/tcp tcp_old_urp_interpretation 1
> ndd -set /dev/hme adv_autoneg_cap 0
> ndd -set /dev/hme adv_100fdx_cap 1
> ndd -set /dev/tcp tcp_xmit_hiwat 65535
> ndd -set /dev/tcp tcp_recv_hiwat 65535
> ndd -set /dev/tcp tcp_cwnd_max 65534
> ndd -set /dev/tcp tcp_conn_req_max 64
>
> Let me know how this works and if you are still haveing problems try
> running a snoop on the port eg. "snoop -d hme0" and have a look at whats
> happening.
>
> From: Steve Phelps <steve@epic.co.uk>
>
> I don't know the answer to your problem,
> but below is a set of questions that will
> help track it down.
>
> Is this a 28115 or variant thereof? We use a Synoptics 28115 with
> a SPARC20 running Solaris 2.5 and NFS without any problems.
>
Nope. It's a BayNetworks Ethernet Workgroup Switch
> Do you have the be/hme patch for 2.4? Unfortunately I can't remember the
> patch number.
>
Unfortunately, there doesn't appear to be one. I have a sneaking
suspicion this is where the problem lies, but then again, it seems odd
that doing a soft reset of the switch would fix the problem ('till the
next heavy load) if it were a software problem (shrug w/confused expression).
> what kind of NFS clients are you running? (PCs, other Solaris machines etc..?)
>
> Is the SPARC20 acting as NFS server or NFS client? If it is acting as NFS
> client
> what is the NFS server? Is the NFS server a 10Mbit machine?
>
It's a SPARC20 acting as an NFS server. I've got a SPARC 4 also plugged
into the switch (a DB server) and the other switch ports are occupied by
hubs. The rest of the clients are a mixed bunch of PCs and various UNIX
platforms (though for my most recent tests I only needed to use a handful
of the SUNs to cause the symptoms). All machines have 10 mbps interfaces
except the server which has only the 100 mbps interface enabled (the other
two 10 mbps interfaces aren't even plumbed).
> What does the following command output:
>
> egrep -i hme /var/adm/messages*
Only the messages 'bout the interface going down or up from when I
unplug it from one port and plug it into another port. To get us through
the day (albeit slowly) I just swapped the ports the two SPARCs were plugged
into -- everything was then running at 10 mbps (the switch and the SUNswift
both autosense).
> From: Kent Clarstroem <kent@istiden.pp.se
>
> Do you have the same MAC-adress on both cards? (ifconfig -a) If so,
> that's what's confusing your switch.
Nope. I've only got one interface from each machine plugged into the
switch. And actually, now I've got no machine on the net with two
interfaces, so it's not a duplicate MAC address thing.
> What I didn't quite understand was if you still want the 10MBit/s
> interface to be connected along with the 100MBit/s, or?
> If you do - change the MAC on one of them. If not - reboot your switch
> with the 100MBit/s interface connected the way you want it to.
I don't think I was very clear in my first post. I was sorta harried
at the time I wrote it (grin). In desperation, I unplugged the server's
SUNSwift card from the switch, and connected it to a 10 mbit port on one
of our hubs (which is, in turn, connected to the switch). I did a soft
reset of the switch, and all was well -- but slow. But even under the
worst loads, the problem doesn't occur this way.
It *does* seem odd to me, though, that I could ping the switch this way,
but not when I had the server plugged directly into the 100 mbit port on
the switch... (sigh) Could I be fighting more than one problem here??
> From: russell@mds.lmco.com (Russell David) (who doesn't believe in using
> the return key - grin)
>
> I had problems with timeouts when I started running NFS V3. The problems
> had to do with dropped packets because the routers could not handle the
> data coming from the FDDI servers going to an ethernet. I was able to see
> that packets were lost by running snoop on a host on the same subnet as
> the client that is getting the timeouts. See if all of the packets are
> received.
A friend of mine brought over a sniffer during the holidays. Both the
sniffer and snoop showed that when there were continuation frames and a
moderate to heavy load, the server wasn't responding to packets. Running
snoop on the server, I see lots of packets going to the server, so it *is*
receiving at least *some* traffic. It's not entirely clear (to this
somewhat unlearned observer, anyway) if it's receiving *all* the packets
it should be -- I don't think so, just looking at the volume of output
from snoop.
> From: Mike.Phillips@cambridge.simoco.com (Mike R. Phillips 5788)
>
> We looked at the 7 port workgroup switch but had horific performance
> problems so we got another 28K switch instead.
>
> Please summarise as I am interested in getting the best out our 28K
> switch and have also had some performance problems still between MS NT 3.51
> at 100BaseT and our UltraSPARCS, BTW what were your clients ?
The ndd settings that Tony gave me (see above) helped the performance
tremendously (nearly double). Going full duplex nearly doubled it again.
It's *really* fast now, but I don't appear to be seeing all of the
packets/frames I should be. Whether this is a problem with the switch,
the ethernet interface, the interface driver, or the kernel is unclear
ta me yet...
Brent
This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:11:17 CDT