Re: Sorry to mail you directly with a NFS question...

From: Rick Macklem <rick.macklem_at_gmail.com>
Date: Fri, 26 May 2023 23:43:59 UTC
I am reposting this to the mailing list, as permitted
by Terry Kennedy, since others might be able to
provide useful input and the discussion gets archived.

rick

On Fri, May 26, 2023 at 4:34 PM Terry Kennedy <terry-freebsd@glaver.org> wrote:
>
> On 2023-05-26 18:47, Rick Macklem wrote:
> > Btw, in general it is better to post to a freebsd mailing
> > list (I read freebsd-current, freebsd-stabke, freebsd-fs
> > plus assorted others). That way replies get archived so
> > that others can find them...
> > (If you agree, I will repost this to freebsd-stable@.)
>
>    That's fine. I expect I'll get a couple of "that version is
> unsupported!" replies...
>
>    I'll answer this here (feel free to repost) but I will
> continue further discussion on freebsd-stable@.
>
> > Almost all of these intermittent hangs are the result
> > of network fabric issues. In particular, NFSv3 has changed
> > very little in recent years.
>
>    Both servers are on the same switch which has not logged
> any errors. Swapping drives to go back to FreeBSD 12.4 on
> the same hardware makes the problem go away.
>
> > On Fri, May 26, 2023 at 12:55 AM Terry Kennedy
> > <terry-freebsd@glaver.org> wrote:
> >>
> >> CAUTION: This email originated from outside of the University of
> >> Guelph. Do not click links or open attachments unless you recognize
> >> the sender and know the content is safe. If in doubt, forward
> >> suspicious emails to IThelp@uoguelph.ca.
> >>
> >>
> >>    ... but before I open a PR on something that's possibly a
> >> silly mistake of mine, I figured I'd ask you since you prob-
> >> ably know the answer off the top of your head.
> >>
> >>    You probably don't remember me - way back in the day we
> >> worked together to solve some BSD/OS NFS issues. Back then
> >> I was terry@spcvxa.spc.edu.
> >>
> >>    I upgraded a system from the latest 12-STABLE to the latest
> >> 13-STABLE and I'm seeing random hangs when doing I/O to a
> >> filesystem exported from 10.4 (yes, I know it is unsupported,
> >> that's why I'm updating 8-). This all worked fine when the
> >> client was running 12-STABLE.
> >>
> >>    The filesystem is exported from 10.4 with:
> >>
> >> /usr/local/src  -maproot=root 192.168.1.166
> >>
> >>    The filesystem is mounted from 13.2 with:
> >> gate:/usr/local/src /usr/local/src      nfs     noinet6,rw,tcp,bg,intr
> >>
> >>    Both the 10.4 server and the 13.2 client have a very vanilla
> >> NFS config in /etc/rc.conf:
> >>
> >> nfs_client_enable="YES"
> >> nfs_server_enable="YES"
> >>
> >>    Processes hang in a D+ state on the client:
> >>
> >> (0:13) 165h:/sysprog/terry# ps -ax | grep D+
> >> 4222  0  D+     0:00.01 edt root.cshrc
> >> 4183  1  D+     0:00.03 _su (tcsh)
> >> 4229  2  D+     0:00.00 ls -F
> >> 4258  3  D+     0:00.00 umount -f /usr/local/src
> >>
> > For this to be useful, you need to do "ps axHl" and include
> > all processes/threads including nfsiod threads. The most
> > important info is the "wchan", which tells us what the process/thread
> > is sleeping on. (All D means is that is sleeping without PCATCH
> > and, as such, cannot be interrupted by a signal.)
> >
> > Having said the above, I suspect one or more threads are waiting
> > for RPC replies and the others for resources (like buffer or vnode
> > locks)
> > held by the ones waiting for an RPC reply.
>
>    I'll trigger the fault and do that.
>
> >>    Neither the 13.2 client nor the 10.4 server logs any errors.
> >>
> >>    Once one process gets into this state, any further attempts
> >> at I/O to the filesystem also hang.
> >>
> >>    A "umount -f /usr/local/src" just chews CPU in [nfs] state:
> > The command to unmount a hung NFS mount is:
> > # umount -N /usr/local/src
> > --> It can take a couple of minutes to complete, depending upon
> >      the situation. (Look at "man umount".)
>
>    I figure that after 30 minutes it just isn't going to happen.
>
>    The 10.4 server continues providing the export to other clients,
> so I'm pretty sure this is on the 13.2 side.
>
> >> (1:3) 165h:/sysprog/terry# umount -f /usr/local/src
> >> load: 0.02  cmd: umount 4258 [nfs] 5.29r 0.00u 0.00s 0% 2324k
> >> load: 0.01  cmd: umount 4258 [nfs] 574.56r 0.00u 0.00s 0% 2324k
> >> load: 0.01  cmd: umount 4258 [nfs] 1400.94r 0.00u 0.00s 0% 2324k
> >> load: 0.06  cmd: umount 4258 [nfs] 2595.05r 0.00u 0.00s 0% 2324k
> >>
> >>    Oddly, lsof gets stuck in a [vfs_busy] state without pro-
> >> ducing any output. It also spins and chews CPU:
> >>
> >> (0:17) 165h:/# lsof
> >> load: 0.00  cmd: lsof 4633 [vfs_busy] 2.00r 0.00u 0.00s 0% 3804k
> >> load: 0.00  cmd: lsof 4633 [vfs_busy] 47.67r 0.00u 0.00s 0% 3804k
> >> load: 0.00  cmd: lsof 4633 [vfs_busy] 146.17r 0.00u 0.00s 0% 3804k
> >> load: 0.01  cmd: lsof 4633 [vfs_busy] 1022.97r 0.00u 0.00s 0% 3804k
> >>
> >>    Neither of these comes anywhere close to consuming a CPU core -
> >> a "top -SH 5000 | grep nfs" produces:
> >>
> >> (0:11) 165h:/sysprog/terry# top -SH 5000 | grep nfs
> >>    PID USERNAME    PRI NICE   SIZE    RES STATE    C   TIME    WCPU
> >> COMMAND
> >>   2471 root         52    0    74M    17M select   0   0:00   0.00%
> >> nfsd
> >>   4183 root         20    0    16M  4168K nfs      8   0:00   0.00%
> >> tcsh
> >>   2472 root         52    0    12M  5280K rpcsvc  23   0:00   0.00%
> >> nfsd{nfsd: master}
> >>   4258 root         22    0    12M  2340K nfs     16   0:00   0.00%
> >> umount
> >>   4229 root         20    0    13M  2768K nfs     20   0:00   0.00% ls
> >>   4691 root        -16    -     0B    16K -       16   0:00   0.00%
> >> newnfs 0
> >>   2472 root         52    0    12M  5280K rpcsvc   5   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc  23   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   8   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   2   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   7   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc  10   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   5   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   0   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   1   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   0   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   1   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   9   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   9   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   3   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   6   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   3   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   2   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   2   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   7   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   1   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   7   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   8   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   3   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   2   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   9   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   3   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   1   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   9   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   9   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   4   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   1   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   4   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   2   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   3   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc  10   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   9   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc  11   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   8   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   8   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   1   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   7   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc  10   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   2   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   7   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc  11   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   7   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   9   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   5   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   4   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   6   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   4   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   2   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc  11   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   2   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   8   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   3   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   5   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   3   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   7   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc  10   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   6   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   8   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   2   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   6   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   0   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   1   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   3   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   5   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   2   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   1   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc  10   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   6   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   1   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc  11   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   1   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   9   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   8   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   0   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   8   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   0   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   9   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   2   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   6   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   0   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc  10   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   1   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   3   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   7   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   2   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   4   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc  10   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   0   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   2   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   7   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   5   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc  11   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   3   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   0   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   2   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   8   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   9   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   3   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   4   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   9   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   0   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   3   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   1   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   4   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   9   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   4   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   8   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   7   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   4   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   5   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   4   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   4   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   4   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   3   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   6   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   9   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   6   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   7   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   4   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   9   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   3   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   9   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   0   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   4   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   8   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   0   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   5   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   9   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   6   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   4   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   9   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   7   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   7   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   3   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   5   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   4   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   9   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   4   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   2   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   0   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   0   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   7   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   6   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   6   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   8   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   1   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   1   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   1   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   0   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   6   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   0   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   8   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   7   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   0   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   0   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc  10   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   0   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   1   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   4   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   7   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   5   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   8   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   1   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   6   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   0   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   7   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   8   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc  11   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   0   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc  11   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   1   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   7   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   0   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   5   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   4   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   7   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   1   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   8   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   5   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   0   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   8   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   6   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   8   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   1   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   4   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   1   0:00   0.00%
> >> nfsd{nfsd: service}
> >>   2472 root         52    0    12M  5280K rpcsvc   6   0:00   0.00%
> >> nfsd{nfsd: service}
> >>
> >>    The network configuration is rather byzantine (FIBs on VLANs
> >> on a LAGG group with MTU 9000) but is exactly the same config
> >> that worked on the same hardware when the client was running
> >> 12.4 instead of 13.2.
> > A 9K MTU is asking for trouble. If the NIC's driver uses 9K jumbo
> > mbuf clusters, that fragments the mbuf cluster pool and this can
> > cause a hang. Many have said 9K jumbo mbuf clusters should not
> > exist, but they do...
> > - Either go to ordinary 1500MTU packets all around or if that is
> >   impossible, consider increasing kern.ipc.nmbcluster a lot,
> >   which might keep the fragmentation fom becoming catastrophic.
>
>    I lost the 9K MTU war on the wider Internet (while working at
> several ISPs over the last 3 decades, trying to get peers / up-
> streams to fix their problems and either forward the packtes or
> at least respond with an ICMP MTU exceeded).
>
>    But on the local LAN everything has been working well (various
> FreeBSD versions up to 12.4-STABLE, Windows 10 systems, and so on).
> Again, this all works fine everywhere except the 13.2 system. I
> get 700+ MByte/sec transfers all around.
>
>    kern.ipc.nmbclusters is currently at the default of 8162241 on
> the client that hangs.
>
> > # sysctl -a | fgrep vm.uma.mbuf
> > - Will give you some stats on mbuf cluster allocation that might be
> >   useful.
>
>    Added to the list of things to look at on a hang.
>
> > Other net things to try:
> > - set net.inet.tcp.sack.enable=0. This was a workaround for a
> >   13.1 TCP stack bug that caused intermittent hangs. (This bug
> >   should be fixed in 13.2, but??
> > - set net.inet.tcp.tso=0. Many intermittent NFS hangs in the past
> >   have been caused by buggy TSO NICs/drivers.
> > - switch to a different kind of NIC/driver.
> >   - If this is not possible, look for an alternate driver for your NIC
> >     on a vendor download site. The vendor's drivers tend to be
> >     quite different than what is distributed with FreeBSD.
> > - Try disabling assorted things for your NIC/driver via "ifconfig",
> >   suck as lro, rxcsum, txcsum.
> > - Look at any stats the NIC's driver generates, which might give
> >   you a hint w.r.t. what is broken in the network fabric.
>
>    The NIC is an Intel 82599ES on a motherboard mezzanine card, so
> not too easy to change. I just ordered a Broadcom 57840S-based
> mezzanine card and will try swapping it if nothing else helps.
>
>    Since it is Intel and not some off-brand part and the same hard-
> ware works with 12.4, I think a newly-introduced driver bug is
> rather unlikely.
>
> > Also, when hung:
> > # netstat -a
> > - To see what state the TCP connection for the mount is in
> >   and to see if the Snd-Q or Rcv-Q are non-empty
>
>    Added to the list of things to check.
>
> > # ping <nfs-server>
> > - To see if basic network connectivity is still in place.
>
>    It is. In fact, another NFS export from the same 12.4 system
> to the same 13.2 system continues working after the other mount
> is hung.
>
>    I would expect that after sitting for an hour with the mount
> 'stuck' that I would get a "NFS server not responding" on the
> 13.2 client.
>
> >
> >>
> >>    The obvious answer is "it isn't guaranteed to work / don't
> >> do that". I'm hoping there's a simple fix so I can proceed with
> >> my upgrades. If I have to, I can move the contents of the mount
> >> to a 12.4 system and export it from there, but that involves
> >> changing /etc/fstab on a few dozen systems as well as new backup
> >> procedures.
> > No obvious simple fix. Getting rid of the 9K jumbo frames would be
> > my first item to try.
> >
> > rick
> > ps: There were known TCP stack bugs in both 13.0 and 13.1 that
> >        caused intermittent NFS hangs, although these were observed
> >        at the NFS server end and should be fixed for 13.2. There is no
> >         known TCP bug that causes NFS hangs at this time.
> >>
> >>    Any advice would be greatly appreciated!
> >>
> >>          Thanks!
> >>          Terry Kennedy
> >>
>
>          Thanks!
>          Terry Kennedy