Slow disk write speeds over network

Wed Jun 11 03:23:11 PDT 2003

Eric Anderson wrote:
> Good news, but not done yet.. Keep reading:

Sean Chittenden also had a couple of good pieces of advice;
read his posting too.

> > You haven't said if you were using UDP or TCP for the mounts;
> > you should definitely use TCP with FreeBSD NFS servers; it's
> > also just generally a good idea, since UDP frags act as a fixed
> > non-sliding window: NFS over UDP sucks.
> 
> Most clients are TCP, but some are still UDP (due to bugs in unmentioned
> linux distros nfs clients).

These will be able to starve each other out.  There is a nifty
DOS against the UDP reassembly code that operates by sending
all the frags in an overly large datagram, but one.

> > Also, you haven't said whether you are using aliases on your
> > network cards; aliases and NFS tend to interact badly.
> 
> Nope, no aliases.. I have one card on each network, with one IP per
> card.  I have full subnets (/24) full of P4's trying to slam the NFS
> server for data all the time..

Good that you have no aliases; the aliasing code is not efficient
for a large number of aliases.  Also, the in_pcbhash code could
use a rewrite to handle INADDR_ANY sockets better.  Not a problem
for your load level or configuration.

> > Finally, you probably want to tweak some sysctl's, e.g.
> >
> >       net.inet.ip.check_interface=0
> >       net.inet.tcp.inflight_enable=1
> >       net.inet.tcp.inflight_debug=0
> >       net.inet.tcp.msl=3000
> >       net.inet.tcp.inflight_min=6100
> >       net.isr.enable=1
> 
> Ok - done.. some where defaults, and I couldn't find net.isr.enable..
> Did I need to config something on my kernel for it to show up?

You have to set a compile option; look in /usr/src/sys/net; grep
for "netisr_dispatch" or just "dispatch".

> Also, can you explain any of those tweaks?

The check_interface makes FreeBSD not care if the interface a
response comes in on is the same as the one a request did.  I
told you to set that one in case your network topology was at
fault.

The inflight_enable allows inflight processing.  This will
cause it to use an expedited processing path.  The debug is
on by default )oir was) when inflight was used, and adds
overhead, so it should be turned off.  Both of these implement
about 1/3 of a receiver livelock solution.

Setting the MSL down decreases your relative bandwidth delay
product; since you are using GigE, this should be relatively
low.  If you had non-local users on a VPN over a slow link,
this would probably be a bad thing.  Local GigE, though, and
it's desirable.

The net.isr.enable=1 will save you about 100ms per packet,
minimum, and more if you have a high interrupt overhead that
livelocks you from running NETISR.  What it does is turns on
direct processing by IP and TCP of packets as they come in
the interface and you get the interrupt.  Combined with soft
interrupt coelescing and polling, they should  give you
another 1/3 of the receiver livelock fixup.  The final third
isn't available, unless you are willing to hack network stack
code and scheduler code, since FreeBSD doesn't include LRP or
Weighted Fair Share Queuing.

> > Given your overloading of your bus, that last one is probably
> > the most important one: it enables direct dispatch.
> >
> > You'll also want to enable DEVICE_POLLING in your kernel
> > config file (assuming you have a good ethernet card whose
> > driver supports it):
> >
> >       options DEVICE_POLLING
> >       options HZ=2000
> 
> Well, the LINT file says only a few cards support it - not sure if I
> should trust that or not, but I have Intel PRO/1000T Server Adapters -
> which should be good enough cards to support it.. I've also put 100Mbit
> cards in place of the gige's for now to make sure I wasn't hitting a
> GigE problem or negotiation problem..

You should grep for DEVICE_POLLING in the network device
drivers you are interested in using to see if they have the
support.  Also you can get up to 15% by adding soft interrupt
coelescing code, if the driver doesn't already support it (I
added it for a couple of drivers, and it was committed after
the benchmarks showed it was good, but it's not everywhere);
the basic idea is you take the interrupt, run rx_eof(), and
call ether_input().  Then repeat the process until you hit
some count limit, or until there's no more data.  The direct
dispatch (net.isr.enable) combined with that will process most
packet trains to completion at interrupt, saving you 10ms up
and 10ms back down per packet exchange (NETISR only runs on
exit from spl or at the HZ time, which is default every 10ms).

> > ...and yet more sysctl's for this:
> >
> >       kern.polling.enable=1
> >       kern.polling.user_frac=50       # 0..100; whatever works best
> >
> > If you've got a really terrible Gigabit Ethernet card, then
> > you may be copying all your packets over again (e.g. m_pullup()),
> > and that could be eating your bus, too.
> 
> Ok, so the end result is that after playing around with sysctl's, I've
> found that the tcp transfers are doing 20MB/s over FTP, but my NFS is
> around 1-2MB/s - still slow.. So we've cleared up some tcp issues, but
> yet still NFS is stinky..
> 
> Any more ideas?

If you have a choice on the disks, go SCSI; you probably won't
have a choice, though, if you haven't bought them already.

The tagged command queuing in ATAPI can't disconnect during a
write, only during a read, so writes serialize and reads don't.
On SCSI, neither writes nor reads serialize (at least until you
hit your tag queue depth).

Standard advice about MBUFS/NMBCLUSTERS; see the NOTES files
about these config options.  Also, I would make sure maxusers
was non-zero: disable the auto-tuning, it's generally not going
to give you an optimal mix for a dedicated server, no matter
what it's dedicated to doing.

There's Sean's suggestions... I don't reccomend some of them,
for data integrity reasons (see my comments in response to his
post), but others are very good.  If you can get your Intel cards
to play nice with your switch, going to 8K packets (jumbograms)
will help.  In my experience, Intel doesn't play nice with other
card vendors, and there's no real standards for MTU negotiation,
so you have to futs with a lot of equipment to get it setup
(manually locking the MTU).  Also, many switchs (e.g. Alpine)
don't really have enough memory in them to deal with this.  Some
GigE cards also have too little memory to do this and offload
the TCP checksum processing.

Reminds me: make sure your checksums are being done by your cards,
if you can: checksum calculations in software are brutal on your
performance.  It is(/was) an ifconfig option.

Just for grins (not for production!) you may want to mount your
FS async, and set the NFS async option Sean wrote about.

I would probably disable SYN-caching and SYN-cookie.  I *always*
disable SYN-cookie on any exposed machine (computational DOS
attack is possible); the SYN-cache is a good defense against a
DOS attack, but if this is an interior machine (and it should
be), then your firewall already protects it; SYN-cache adds
some overhead (read: latency) you probably don't want, and the
cookie code will be harmless, but isn't terribly useful unless
you are getting a huge connection attempt per second rate.

You may also want to disable slowstart and the Nagle algorithm,
but you will have to look those up (doing it makes you a bad
network citizen, and I would be aiding and abetting ;^)).  It
shouldn't be *too* bad if you're switched rather than bridged
or hub'ed all the way through (L4, not L2, so no Alpine GigE).

If you are willing to hack code, PSC at CMU had a nice rate
halving implementation for a slightly older version of the
BSD stack, and both Rice University and Duke University have
an LRP implementation (Duke's is more modern), but you'll
have to know what you're doing in the stack to port any of
these.

You probably don't need to worry about load-shedding until your
machine is spending all its time in interrupt, so there's no use
going into RED-queueing or other programming work.

-- Terry