NFS: kernel modules (loading/unloading) and scheduling

Wed Feb 25 02:45:04 UTC 2015

In article
<388835013.10159778.1424820357923.JavaMail.root at uoguelph.ca>,
rmacklem at uoguelph.ca writes:

>I tend to think that a bias towards doing Getattr/Lookup over Read/Write
>may help performance (the old "shortest job first" principal), I'm not
>sure you'll have a big enough queue of outstanding RPCs under normal load
>for this to make a real difference.

I don't think this is a particularly relevant condition here.  There
are lots of ways RPCs can pile up where you really need to do better
work-sharing than the current implementation does.  One example is a
client that issues lots of concurrent reads (e.g., a compute node
running dozens of parallel jobs).  Two such systems on gigabit NICs
can easily issue large reads fast enough to cause 64 nfsd service
threads to blocked while waiting for the socket send buffer to drain.
Meanwhile, the file server is completely idle, but unable to respond
to incoming requests, and the other users get angry.  Rather than
assigning new threads to requests from the slow clients, it would be
better to let the requests sit until the send buffer drains, and
process other clients' requests instead of letting the resources get
monopolized by a single user.

Lest you think this is purely hypothetical: we actually experienced
this problem today, and I verified with "procstat -kk" that all of the
nfsd threads were in fact blocked waiting for send buffer space to
open up.  I was able to restore service immediately by increasing the
number of nfsd threads, but I'm unsure to what extent I can do this
without breaking other things or hitting other bottlenecks.[1]  So I
have a user asking me why I haven't enable fair-share scheduling for
NFS, and I'm going to have to tell him the answer is "no such thing".

-GAWollman

[1] What would the right number actually be?  We could potentially
have many thousands of threads in a compute cluster all operating
simultaneously on the same filesystem, well within the I/O capacity of
the server, and we'd really like to degrade gracefully rather than
falling over when a single slow client soaks up all of the nfsd worker
threads.