Re: 13-stable NFS server hang

From: Rick Macklem <rick.macklem_at_gmail.com>
Date: Wed, 28 Feb 2024 14:57:03 UTC
On Tue, Feb 27, 2024 at 9:30 PM Garrett Wollman <wollman@bimajority.org> wrote:
>
> Hi, all,
>
> We've had some complaints of NFS hanging at unpredictable intervals.
> Our NFS servers are running a 13-stable from last December, and
> tonight I sat in front of the monitor watching `nfsstat -dW`.  I was
> able to clearly see that there were periods when NFS activity would
> drop *instantly* from 30,000 ops/s to flat zero, which would last
> for about 25 seconds before resuming exactly as it was before.
>
> I wrote a little awk script to watch for this happening and run
> `procstat -k` on the nfsd process, and I saw that all but two of the
> service threads were idle.  The three nfsd threads that had non-idle
> kstacks were:
>
>   PID    TID COMM                TDNAME              KSTACK
>   997 108481 nfsd                nfsd: master        mi_switch sleepq_timedwait _sleep nfsv4_lock nfsrvd_dorpc nfssvc_program svc_run_internal svc_run nfsrvd_nfsd nfssvc_nfsd sys_nfssvc amd64_syscall fast_syscall_common
My guess is that "master" is just a glitch here. The RPCs are all serviced
by kernel threads. The "server" process is just the process these threads
would normally be associated with. The "master" process only enters the
kernel to add a new TCP socket for servicing. (ie. I don't know why procstat
thought the kernel thread was associated with "master" instead of "server",
but it shouldn't matter.)

>   997 960918 nfsd                nfsd: service       mi_switch sleepq_timedwait _sleep nfsv4_lock nfsrv_setclient nfsrvd_exchangeid nfsrvd_dorpc nfssvc_program svc_run_internal svc_thread_start fork_exit fork_trampoline
This one is a fresh mount (note the exchangeid) and this one needs an exclusive
lock on the NFSv4 state. Put another way, it is trying to stop all
other nfsd thread
activity, so that it can add the new client.

>   997 962232 nfsd                nfsd: service       mi_switch _cv_wait txg_wait_synced_impl txg_wait_synced dmu_offset_next zfs_holey zfs_freebsd_ioctl vn_generic_copy_file_range vop_stdcopy_file_range VOP_COPY_FILE_RANGE vn_copy_file_range nfsrvd_copy_file_range nfsrvd_dorpc nfssvc_program svc_run_internal svc_thread_start fork_exit fork_trampoline
Copy file range has some code in the generic function that limits the
copy time to
one second for a large copy.  I had hoped this was sufficient to avoid Copy
from "hogging" the server.  My guess is that the 1 second limit is not working.
This may be because the ZFS function does not return for a long time.
When I wrote the code, ZFS did not have its own VOP_COPY_FILE_RANGE().
(There is a special "kernel only" flag that needs to be set in the argument
for vn_generic_copy_file_range()  that ZFS might not be doing. I'll look later
to-day. However, it will need to be fixed for the case where it does not
call vn_generic_copy_file_range().)

When I first did the code, I limited the time it spent in VOP_COPY_FILE_RANGE()
by clipping the size of the Copy, but the problem is that it is hard
to estimate how
large a transfer can take without knowing any details about the storage.
I might have to go back to this approach.

>
> I'm suspicious of two things: first, the copy_file_range RPC; second,
> the "master" nfsd thread is actually servicing an RPC which requires
> obtaining a lock.  The "master" getting stuck while performing client
> RPCs is, I believe, the reason NFS service grinds to a halt when a
> client tries to write into a near-full filesystem, so this problem
> would be more evidence that the dispatching function should not be
> mixed with actual operations.  I don't know what the clients are
> doing, but is it possible that nfsrvd_copy_file_range is holding a
> lock that is needed by one or both of the other two threads?
Sort of. My guess is that ZFS's VOP_COPY_FILE_RANGE() is
taking a long time doing a large copy while holding the shared
lock on the nfsd.
Meanwhile the second thread is trying to acquire an exclusive
lock on the nfsd so that it can add the new client.  This means
that the other nfsd threads will slowly get blocked iuntil they
all release the shared lock.

>
> Near-term I could change nfsrvd_copy_file_range to just
> unconditionally return NFSERR_NOTSUP and force the clients to fall
> back, but I figured I would ask if anyone else has seen this.
Yes, I think that is what you will need to do to avoid this.

Thanks for reporting it. I have some work to do,
I need to think of how ZFS's VOP_COPY_FILE_RANGE() can be
limited so that it does not "hog" the server.

rick

>
> -GAWollman
>
>