Issues with NFS RPC

Reply: Rick Macklem : "Re: Issues with NFS RPC"
Go to: [ bottom of page ] [ top of archives ] [ this month ]

From: Adam Stylinski <kungfujesus06_at_gmail.com>
Date: Tue, 06 Jul 2021 13:48:21 UTC

Hello,

So this may be something somewhat specific to my configuration, but it's
starting to smell like a bug somewhere in NFS's RPC handling (either the
Linux client or the FreeBSD rpcbind).

I have two machines, connected via a 40gbps direct attached link, with
static IPs.  They are leveraging jumbo frames (9000 byte MTU).  The storage
is backed by a healthy zpool.  I can reliably reproduce this issue, but it
takes a long amount of time (it was 40GB worth of packet capture before I
gave up and then the issue finally reappeared).

It seems that after a long enough time frame over an NFSv3 export,
virtualbox hangs my VM that has disks backed over that share.  The rsize
and wsize are 128k to match the maximum stripe size of the pool, and I'm
just using plain old sec=sys, no kerberos involved.  The error I get from
rpcdebug on the Linux client looks as follows:
https://pastebin.com/rCv2ZTri
Error 110 I looked up is a generic timeout.  During this time, when the
server seems to be going deaf to these xids, I can ping the server over the
interface the connection is over.  Traffic flows fine, the NICs are
basically unutilized.  There are no visible errors on any of the
interfaces.  The NICs are ConnectX-3's, running in en mode (ethernet).  I
tried switching to NFSv4, and eventually had the same problem, but with the
added bonus that it never seems to successfully retransmit and hangs in
perpetuity (NFSv3 eventually recovers, after the likely 600 second timeout).

These seem to be fairly reliable NICs, and I don't see anything on the
server or client to indicate that it's a network hardware issue.  Is there
anything I can do to diagnose this on the FreeBSD server end?  It seems
that the Linux kernel's rpcdebug facilities seem to mostly just give a
bunch of noise.

I did manage to run wireshark on the client during this stall period, and I
had noticed some TCP packets that were classified as duplicate ACKs when
the NFS traffic finally turned over again.