Implementing backpressure in the NFS server
    Garrett Wollman 
    wollman at csail.mit.edu
       
    Wed Feb 25 22:08:25 UTC 2015
    
    
  
Here's the scenario:
1) A small number of (Linux) clients run a large number of processes
(compute jobs) that read large files sequentially out of an NFS
filesystem.  Each process is reading from a different file.
2) The clients are behind a network bottleneck.
3) The Linux NFS client will issue NFS3PROC_READ RPCs (potentially
including read-ahead) independently for each process.
4) The network bottleneck does not serve to limit the rate at which
read RPCs can be issued, because the requests are small (it's only the
responses that are large).
5) Even if the responses are delayed, causing one process to block,
there are sufficient other processes that are still runnable to allow
more reads to be issued.
6) On the server side, because these are requests for different file
handles, they will get steered to different NFS service threads by the
generic RPC queueing code.
7) Each service thread will process the read to completion, and then
block when the reply is transmitted because the socket buffer is full.
8) As more reads continue to be issued by the clients, more and more
service threads are stuck waiting for the socket buffer until all of
the nfsd threads are blocked.
9) The server is now almost completely idle.  Incoming requests can
only be serviced when one of the nfsd threads finally manages to put
its pending reply on the socket send queue, at which point it can
return to the RPC code and pick up one request -- which, because the
incoming queues are full of pending reads from the problem clients, is
likely to get stuck in the same place.  Lather, rinse, repeat.
What should happen here?  As an administrator, I can certainly
increase the number of NFS service threads until there are sufficient
threads available to handle all of the offered load -- but the load
varies widely over time, and it's likely that I would run into other
resource constraints if I did this without limit.  (Is 1000 threads
practical? What happens when a different mix of RPCs comes in -- will
it livelock the server?)
I'm of the opinion that we need at least one of the following things
to mitigate this issue, but I don't have a good knowledge of the RPC
code to have an idea how feasible this is:
a) Admission control.  RPCs should not be removed from the receive
queue if the transmit queue is over some high-water mark.  This will
ensure that a problem client behind a network bottleneck like this one
will eventually feel backpressure via TCP window contraction if
nothing else.  This will also make it more likely that other clients
will still get their RPCs processed even if most service threads are
taken up by the problem clients.
b) Fairness scheduling.  There should be some parameter, configurable
by the administrator, that restricts the number of nfsd threads any
one client can occupy, independent of how many requests it has
pending.  A really advanced scheduler would allow bursting over the
limit for some small number of requests.
Does anyone else have thoughts, or even implementation ideas, on this?
-GAWollman
    
    
More information about the freebsd-net
mailing list