Implementing backpressure in the NFS server

Thu Feb 26 00:27:38 UTC 2015

Alfred Perlstein wrote:
> 
> On 2/25/15 5:08 PM, Garrett Wollman wrote:
> > Here's the scenario:
> >
> > 1) A small number of (Linux) clients run a large number of
> > processes
> > (compute jobs) that read large files sequentially out of an NFS
> > filesystem.  Each process is reading from a different file.
> >
> > 2) The clients are behind a network bottleneck.
> >
> > 3) The Linux NFS client will issue NFS3PROC_READ RPCs (potentially
> > including read-ahead) independently for each process.
> >
> > 4) The network bottleneck does not serve to limit the rate at which
> > read RPCs can be issued, because the requests are small (it's only
> > the
> > responses that are large).
> >
> > 5) Even if the responses are delayed, causing one process to block,
> > there are sufficient other processes that are still runnable to
> > allow
> > more reads to be issued.
> >
> > 6) On the server side, because these are requests for different
> > file
> > handles, they will get steered to different NFS service threads by
> > the
> > generic RPC queueing code.
> >
> > 7) Each service thread will process the read to completion, and
> > then
> > block when the reply is transmitted because the socket buffer is
> > full.
> >
> > 8) As more reads continue to be issued by the clients, more and
> > more
> > service threads are stuck waiting for the socket buffer until all
> > of
> > the nfsd threads are blocked.
> >
> > 9) The server is now almost completely idle.  Incoming requests can
> > only be serviced when one of the nfsd threads finally manages to
> > put
> > its pending reply on the socket send queue, at which point it can
> > return to the RPC code and pick up one request -- which, because
> > the
> > incoming queues are full of pending reads from the problem clients,
> > is
> > likely to get stuck in the same place.  Lather, rinse, repeat.
> >
> > What should happen here?  As an administrator, I can certainly
> > increase the number of NFS service threads until there are
> > sufficient
> > threads available to handle all of the offered load -- but the load
> > varies widely over time, and it's likely that I would run into
> > other
> > resource constraints if I did this without limit.  (Is 1000 threads
> > practical? What happens when a different mix of RPCs comes in --
> > will
> > it livelock the server?)
> >
As far as I know, even 256 is an arbitrary limit left from when a server
was typically a single core i386.
Since they are just kernel threads, the idle ones are very little overhead.
There is a 256 limit wired into the sources, but you can increase this and
recompile. (MAXNFSDCNT in nfsd.c)
I can't think of why 1000 threads isn't practical for server hardware of the
size you run.

> > I'm of the opinion that we need at least one of the following
> > things
> > to mitigate this issue, but I don't have a good knowledge of the
> > RPC
> > code to have an idea how feasible this is:
> >
> > a) Admission control.  RPCs should not be removed from the receive
> > queue if the transmit queue is over some high-water mark.  This
> > will
> > ensure that a problem client behind a network bottleneck like this
> > one
> > will eventually feel backpressure via TCP window contraction if
> > nothing else.  This will also make it more likely that other
> > clients
> > will still get their RPCs processed even if most service threads
> > are
> > taken up by the problem clients.
> >
This sounds like a good idea in theory. However, I'm not sure how you
can implement it.
As you mention, Read requests are small. However, Write requests are
fairly large.
--> one 64K write request will result in as many bytes in the socket's
    receive queue as something like 700 Read requests.
As such, using the socket receive queue's sb_cc isn't going to work.
Since TCP doesn't have record marks, there is no way to know how
many RPC requests are in the queue until they are processed.

But, by the time the krpc has parsed out an RPC request from the
socket's receive queue, it is pretty must "too late".

For NFSv3, doing something like file handle affinity does, which
pre-parses the RPC request, then putting it in some other queue
instead of handing it to an nfsd right away, might be feasible.
Also, lots of Getattr and/or Lookup isn't the same as lots of Reads.
Then you'd have to decide how long to delay the RPC. If you use
TCP send queue length, then what about cases where there are a lot
of TCP connections for the clients on the other side of the
network bottleneck?
(However, since all NFSv4 RPCs are the same, ie Compound, this
 doesn't work for NFSv4, since it is very hard to parse and
 determine what an NFSv4 compound does without parsing it completely.
 The current code parses it as the Ops in it are done.)

I think just having lots of nfsd threads and letting a bunch of them
block on the socket send queue is much simpler to me.
One of the nice things about using TCP transport is that it will apply
backpressure at this level.

> > b) Fairness scheduling.  There should be some parameter,
> > configurable
> > by the administrator, that restricts the number of nfsd threads any
> > one client can occupy, independent of how many requests it has
> > pending.  A really advanced scheduler would allow bursting over the
> > limit for some small number of requests.
> >
I've thought about this and so long as you go with "per TCP connection"
instead of per-client (which I think is close to the same thing in practice),
it may be a good idea.
I suspect the main problem with this is that it will negatively impact clients
when most other clients are idle. (The worst case is when one client wants to
do lots of reading when no other client mount is active.)

I thought of something like a running estimate of "active clients" and then
divide that into the total # of nfsd threads, but then estimating "active clients"
will be a heuristic at best.

Also, what about the case of many clients each doing some reads behind the
network bottleneck instead of fewer clients doing lots of reads behind the
network bottleneck?

> > Does anyone else have thoughts, or even implementation ideas, on
> > this?
> The default number of threads is insanely low, the only reason I
> didn't
> bump them to FreeNAS levels (or higher) was because of the inevitable
> bikeshed/cryfest about Alfred touching defaults so I didn't bother.
>  I
> kept them really small, because y'know people whine, and they are
> capped
> at ncpu * 8, it really should be higher imo.
> 
Heh, heh. That's why I never change defaults. Also, just fyi, I got
email complaining about this and asking it be reverted back to "4"
threads total by default. I just suggested they contact you;-)

> Just increase the nfs servers to something higher, I think we were at
> 256 threads in FreeNAS and it did us just fine.  Higher seemed ok,
> except we lost a bit of performance.
> 
Yep, that's what I'd suggest too. If you can't get this to work well,
then looking more closely at implementing one of your other suggestions.
(I'd also recompile nfsd, so that you can go past 256 if you need to.)

Good luck with it, rick

> The only problem you might see is on SMALL machines where people will
> complain.  So probably want an arch specific override or perhaps a
> memory based sliding scale.
> 
> If that could become a FreeBSD default (with overrides for small
> memory
> machines and arches) that would be even better.
> 
> I think your other suggestions are fine, however the problem is that:
> 1) they seem complex for an edge case
> 2) turning them on may tank performance for no good reason if the
> heuristic is met but we're not in the bad situation
> 
> That said if you want to pursue those options, by all means please
> do.
> 
> -Alfred
> _______________________________________________
> freebsd-fs at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-fs
> To unsubscribe, send any mail to "freebsd-fs-unsubscribe at freebsd.org"
>