FreeBSD 13.2 NFS client mount hangs

From: J David <j.david.lists_at_gmail.com>
Date: Fri, 29 Sep 2023 22:55:00 UTC
I have noticed a new (to me) hang on FreeBSD NFS client machines
running 13.2-RELEASE-p2.

It's happened twice this week to Apache processes.  It's the root EUID
process and it appears to happen while the process is starting up or
reconfiguring.  I.e., while it's reading the configs.

The configs are not on NFS storage.  But the vhost document roots are.

The process ps looks like this:

    0 19557 19548  3  25  5  25248 12036 nfstry   DN    -      0:12.85
/usr/local/apache/2.4/bin/httpd -D FOREGROUND -f
/usr/local/apache/2.4/conf/httpd.conf

The procstat -kk looks like:

  PID    TID COMM                TDNAME              KSTACK
19557 100341 httpd               -                   mi_switch+0xc2
sleepq_timedwait+0x2f _sleep+0x1ce clnt_vc_call+0x866
clnt_reconnect_call+0x626 newnfs_request+0xc36 nfscl_request+0x5a
nfsrpc_getattr+0xbb nfs_close+0x489 vop_sigdefer+0x2b
VOP_CLOSE_APV+0x1c vn_close1+0x16a vn_closefile+0x3d _fdrop+0x11
closef+0x24b closefp_impl+0x69 amd64_syscall+0x10c
fast_syscall_common+0xf8

The process slowly gains CPU time (a few hundredths per minute) but is
immune to kill -9 so it doesn't seem to be coming out of the kernel at
any point.

I tried running procstat -kk every few seconds to see if I would get
anything different to show what it's doing. Most are the same as
above, but I also got this:

19557 100341 httpd               -                   mi_switch+0xc2
sleepq_timedwait+0x2f _sleep+0x1ce nfs_catnap+0x47
newnfs_request+0x14b3 nfscl_request+0x5a nfsrpc_getattr+0xbb
nfs_close+0x489 vop_sigdefer+0x2b VOP_CLOSE_APV+0x1c vn_close1+0x16a
vn_closefile+0x3d _fdrop+0x11 closef+0x24b closefp_impl+0x69
amd64_syscall+0x10c fast_syscall_common+0xf8

(This differs starting at the newnfs_request after nfscl_request+0x5a.)

I started unmounting NFS filesystems until I hit one where umount
hung.  An ls on that filesystem also hung. However, an ls of that
filesystem from another client machine worked fine, so it does appear
to be a client-side issue rather than a server problem.  umount -f
also hung.  umount -N did unmount it very quickly and that caused all
the hanging umounts and the
httpd process to exit immediately.

I didn't find anything good in the syslog or dmesg. The only thing
related to nfs are a handful of "nfsv4 err=10068" that look like they
were way back near when the system booted (about 5 days ago).

The mount flags are:

nfsv4,minorversion=2,oneopenown,tcp,resvport,nconnect=1,hard,cto,sec=sys,acdirmin=3,acdirmax=60,acregmin=5,acregmax=60,nametimeo=60,negnametimeo=60,rsize=65536,wsize=65536,readdirsize=65536,readahead=1,wcommitsize=16777216,timeout=120,retrans=2147483647

Is there any other information I could provide or try to catch next
time that would help debug this?

Thanks!