Re: optimising nfs and nfsd

From: Rick Macklem <rick.macklem_at_gmail.com>
Date: Tue, 31 Oct 2023 00:31:50 UTC
On Mon, Oct 30, 2023 at 6:48 AM void <void@f-m.fm> wrote:
>
> Hi Rick, thanks for the info
>
> On Sun, 29 Oct 2023, at 20:28, Rick Macklem wrote:
>
> > In summary, if you are getting near wire speed and you
> > are comfortable with your security situation, then there
> > isn't much else to do.
>
> It seems to depend on the nature of the workload. Sometimes
> wire speed, sometimes half that. And then:
>
> 1. some clients - many reads of small files, hardly any writes
> 2. others - many reads, loads of writes
> 3. same as {1,2} above, huge files
> 4. how many clients access at once
> 5. how many clients of [1] and [2] types access at the same time
Well, here's a couple more things to look at:
- Number of nfsd threads. I prefer to set the min/max to the same
  value (which is what the "-n" option on nfsd does).  Then, after
  the server has been running for a while in production load, I do:
  # ps axHl | fgrep nfsd
  and I look to see how many of the threads have a TIME of
  0:00.00. (These are extra tthreads that are not needed.)
  If there is a moderate number of these, I consider it aok.
  If there are none of these, more could improve NFS performance.
  If there are lots of these, the number can be decreased, but they
  don't result in much overhead, so I err on the large # side.
  - If you have min set to less than max, the above trick doesn't
    work, but I'd say that if the command shows the max# of threads,
    it could be increased.
This number can be configured via options on the nfsd command line.
If you aren't running nfsd in a jail, you can also fiddle with them via
the sysctls:
vfs.nfsd.minthreads
vfs.nfsd.maxthreads

The caveat is that, if the NFS server is also doing other things,
increasing the number of nfsd threads can result in nfsd "hogging"
the system.
--> You  might be forced to reduce the number of threads to avoid this.
I prefer to set min/max to the same value for a couple of reasons...
- The above trick for determining if I have enough threads works.
- NFS traffic is very bursty. I want the threads to be sitting there ready
  to handle a burst of RPC requests, instead of the server code spinning
  up threads after it sees the burst of requests.
- Extra threads are not much overhead. An entry in the proc table plus
  a few Kbytes for a kernel stack.
(Others will disagree with this, I suspect;-)

NFSv4 server hash table sizes:
Run "nfsstat -E -s" on the server after it has been up under production
load for a while.
Look at the section near the end called "Server:".
The number under "Clients" should be roughly the number of client
systems that have NFSv4 mounts against the server.
The two tunables:
vfs.nfsd.clienthashsize
vfs.nfsd.sessionhashsize
should be something like 10% of the number of Clients.

Then add the numbers under "Opens", "Locks" and "Delegs":
The two tunables:
vfs.nfsd.fhhashsize
vfs.nfsd.statehashsize
should be something like 5-10% of that total.

If the sizes are a lot less that the above, the nfsd will spend more
CPU rattling down rather long lists of entries, searching for a match.
The above four tunables must be set in /boot/loader.conf and the
NFS server system rebooted for the change to take effect.

Now, this one is in the "buyer beware" category...
NFS clients can do writes one of two ways (there are actually others
but they aren't worth discussing):
A - Write/unstable, Write/unstable,...,Commit
B - Write/file_sync, Write/file_sync,...
After the Commit for (A) and after every Write for (B), the server is
required to have all data/metadata changes committed to stable
storage, so that a crash immediately after replying to the RPC will
not result in data loss/corruption.

The problem is that this can result in slow write performance for
an NFS server. If you understand that data loss/corruption can
occur after a server crash/reboot and can live with that, an NFS
server can be configured to "cheat" and not commit the data/metadata
to stable storage right away, improving performance.
I'm no ZFS guy, but I think "sync=disabled" does this for ZFS.
You can also set:
vfs.nfsd.async=1
to make the NFS server reply that data has been File_sync'd
so that the client never needs to do a Commit even when it specified Unstable.
*** Do this at your peril. Back when I worked for a living, I did this
     on a NFS server that stored undergrad student home dirs.
     The server was slow but solid and undergrads could have survived some
      corruption if the server did crash/reboot (I don't recall that
it ever did crash).

Again, I'm not an NFS guy, but I think that setting up a ZIL on a dedicated
fast storage device (or a mirrored pair of them) is the better/correct way to
deal with this.

NIC performance:
- Most NFS requests/replies are small (100-200byte) messages that end up
  in their own net packet.
  This implies that a 1Gbps NIC might handle 1000+ messages in each direction
   per second, concurrently.
  --> I strongly suspect that not all 1Gbps NICs/drivers can handle 1000+ sends
       and 1000+ receives in a seconds. If it cannot, that will impact
NFS performance.
A simple test that will load a NFS server for this is a "ls -lR" of a
large subtree of
small directories on the NFS mount.
--> The fix is probably using a different NIC/driver.

>
> looking for an all-in-one synthetic tester if there's such a thing.
None that I am aware of:
SPEC had (does SPEC still exist?) a NFS server load benchmark,
but it was not a freebie, so I have no access to it.
(If I recall correctly, you/your company had to become a SPEC
 member, agree to the terms under which testing and publication
 of results could be done, etc and so forth.)

rick

>
> Large single client transfers client to server are wire speed.
> Not tested much else, (not sure how), except with dd but that's
> not really a real-world workload. I'll try the things you suggested.
>
> what I can report now, on the server, so before nfs is considered:
>
> dd if=/dev/urandom of=test-128k.bin bs=128k count=64000 status=progress
>   8346009600 bytes (8346 MB, 7959 MiB) transferred 59.001s, 141 MB/s
>
> dd if=test-128k.bin of=/dev/null bs=128k status=progress
>   6550061056 bytes (6550 MB, 6247 MiB) transferred 3.007s, 2178 MB/s
>
> dd if=/dev/urandom of=test-4k.bin bs=4k count=2048000 status=progress
>   8301215744 bytes (8301 MB, 7917 MiB) transferred 78.063s, 106 MB/s
>
> dd if=test-4k.bin of=/dev/null bs=4k status=progress
>   7725998080 bytes (7726 MB, 7368 MiB) transferred 10.002s, 772 MB/s
>
> dd if=/dev/urandom of=test-512b.bin bs=512 count=16384000 status=progress
>   8382560256 bytes (8383 MB, 7994 MiB) transferred 208.019s, 40 MB/s
>
> dd if=test-512b.bin of=/dev/null bs=512 status=progress
>   8304610304 bytes (8305 MB, 7920 MiB) transferred 63.062s, 132 MB/s
>