[Bug 276299] Write performance to NFS share is ~4x slower than on 13.2

From: <bugzilla-noreply_at_freebsd.org>
Date: Sun, 14 Jan 2024 03:30:30 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=276299

--- Comment #10 from Rick Macklem <rmacklem@FreeBSD.org> ---
By network fabric I mean everything
from the TCP stack down, at both ends.

A problem can easily manifest itself as
only a problem during writing. Writing to
an NFS server is very different traffic as
reading from a NFS server.
I am not saying that it is a network fabric
problem, just that good read performance does
not imply it is not a network fabric problem.

I once saw a case where everything worked fine
over NFS (where I worked as a sysadmin) until
one specific NFS RPC was done. That NFS RPC
(and only that NFS RPC would fail).
It turned out to be a hardware bug in a
network switch. Move the machine to a port
on another switch and the problem went away.
Move it onto the problem switch and the issue
showed up again. There were no detectable other
problems with this switch and the manufacturer
returned it after a maintenance cycle claiming
it was fixed. It still had the problem, so it
went in the trash. (It probably had a memory
problem that flipped a bit for this specific case
or some such.)

Two examples of how a network problem might affect
NFS write performance, but not read performance.
Write requests are the only large RPC messages
sent from client->server. With a !Mbyte write size,
each write results in about 700 1500byte TCP segments
(for an ordinary ethernet packet size).
-> If the burst of 700 packets causes one to be dropped
   on the server (receive) end sometimes...
   (Found by seeing an improvement with a smaller wsize.)
-> If the client/sender has a TSO bug (the most common problem
   is mishandling a TSO segment that is slightly less than 64Kbyytes.
   (Found by disabling TSO in the client. Disabling TSO also
    changes the timing of the TCP segments and this can sometimes
    avoid bugs.)
Have you yet tried a smaller rsize/wsize as I suggested.

NFS traffic is also very different than typical
TCP traffic. For example, both 13.0 and 13.1 shipped
with bugs in the TCP stack that affected the NFS
server (intermittent hangs in these cases).

If it isn't a network fabric problem it is probably
something related to ZFS. I know nothing about ZFS,
so I can't even suggest anything beyond "sync=disabled".

Since an NFS server uses both storage (hardware + ZFS)
and networking, any breakage anywhere in these can
cause a big performance hit.
NFS itself just translates between the NFS RPC message
and VFS/VOP calls. It is conceivable that some change
in the NFS server is causing this, but these changes
are few and others have not reported similar write
performance problems for 14.0, so it seems unlikely.

-- 
You are receiving this mail because:
You are the assignee for the bug.