[Bug 178231] [nfs] 8.3 nfsv4 client reports "nfsv4 client/server protocol prob err=10026"

From: <bugzilla-noreply_at_freebsd.org>
Date: Sat, 02 Oct 2021 01:39:48 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=178231

Chris Stephan <chris.stephan@live.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |chris.stephan@live.com

--- Comment #5 from Chris Stephan <chris.stephan@live.com> ---
Seeing same logs as OP: nfsv4 client/server protocol prob err=10026

In an effort to prepare to migrate to our next deployment of production servers
to leverage NFSv4 over TLS, we have built a development environment on
13.0-RELEASE. All hosts were built from scratch using 13.0-RELEASE txz's,
expanded into custom ZFS datasets on Supermicro X9 series Intel. NFS Clients
are Diskless Dell workstations, booting PXE from /srv/tftpboot, root onto MFS,
mouting NFS shares on server on /net. MFS includes most of base.txz. Minor bits
have been removed mainly from /usr/share/ /boot that are better suited for
tmpfs or NFS. /usr/bin and /usr/sbin are untouched. The method for this build
is identical to what was is done for our 12.2 deployment, short of the source
now being the 13-RELEASE tarball and the use of NFSv4 vs NFSv3 previously.

## /etc/fstab entries tried:
192.168.10.101:/   /net    nfs    nfsv4,rw,hard,tcp             0 0
192.168.10.101:/   /net    nfs    nfsv4,rw,soft,retrycnt=0,tcp  0 0

All works fine, until one of the testers starts Chromium, which, after even
minor browsing causes all of the window system to freeze. Originally we thought
this was a bug in Chrome or the X server. After trying to isolate with D-trace
for the last three days, we found we can trigger the event by waiting for the
window system to freeze up and setting up a dtrace on the window manager
(fluxbox) and right clicking on the desktop to open a menu, which triggers the
open() call to pull the menu file from the user's home directory in:

`/net/home/<user>/.fluxbox/menu`

NFS never fulfills this request. It locks up the window manager and I can
switch to another VT and troubleshoot from there. On the NFS server,
/usr/sbin/nfsdumpstate shows there are locks for each of the clients running
chrome. On the clients, /usr/bin/nfsstat shows thousands of timeout and retries
but they have stopped incrementing by the time everything has locked up. When
checking stats even after hours of running (so long as Chrome is not started),
stats in question stay at 0. It seems apparent that the NFS client has seized
at this point and can not recover. Rebooting a client does not clear the locks
on the server. clear_locks does not appear to resolve the server side either
(but I'm not sure clear_locks works with NFSv4. 

Any application, CLI or GUI which accesses the NFS system locks up and never
returns. If I try to umount [-f] /net the command locks the VT. if I try to
read the subtree of /net issuance of the command locks the VT. 

Our previous setup uses NFSv3+KRB5i/p on 12.2-RELEASE-p10 and works flawlessly. 

We also have tried connecting a lab client from our 12.2 cluster as an NFSv4
client to the 13.0 server, and the same thing happens. I am not willing to
attempt to connect a client via NFSv4 to our 12.2 cluster because I really
don't want to cause some further issue in the event we lock up the server with
active testing going on.

When we change the client mounts to NFSv3, all is well. So this definately
feels like a bug in the NFSv4 client or server.

Anyways, I'm at the point where I feel like there are smarter folk than I that
might be interested in looking at this. I have an relatively idle cluster ready
for all the testing anyone wants to throw at it.

-- 
You are receiving this mail because:
You are the assignee for the bug.