Re: nfs stalls client: nfsrv_cache_session: no session

From: Rick Macklem <rmacklem_at_uoguelph.ca>
Date: Sat, 16 Jul 2022 13:43:11 UTC
Peter <pmc@citylink.dinoex.sub.org> wrote:
> Hija,
>  I have a problem with NFSv4:
>
> The configuration:
>   Server Rel. 13.1-RC2
>     nfs_server_enable="YES"
>     nfs_server_flags="-u -t --minthreads 2 --maxthreads 20 -h ..."
Allowing it to go down to 2 threads is very low. I've never even
tried to run a server with less than 4 threads. Since kernel threads
don't generate much overhead, I'd suggest replacing the
minthreads/maxthreads with "-n 32" for a very small server.
(I didn't write the code that allows number of threads to vary and
 never use that either.)

>     mountd_enable="YES"
>     mountd_flags="-S -p 803 -h ..."
>     rpc_lockd_enable="YES"
>     rpc_lockd_flags="-h ..."
>     rpc_statd_enable="YES"
>     rpc_statd_flags="-h ..."
>     rpcbind_enable="YES"
>     rpcbind_flags="-h ..."
>     nfsv4_server_enable="YES"
>     sysctl vfs.nfs.enable_uidtostring=1
>     sysctl vfs.nfsd.enable_stringtouid=1
> 
>   Client bhyve Rel. 13.1-RELEASE on the same system
>     nfs_client_enable="YES"
>     nfs_access_cache="600"
>     nfs_bufpackets="32"
>     nfscbd_enable="YES"
> 
>   Mount-options: nfsv4,readahead=1,rw,async
I would expect the behaviour you are seeing for "intr" and/or "soft"
mounts, but since you are not using those, I don't know how you
broke the session? (10052 is NFSERR_BADSESSION)
You might want to do "nfsstat -m" on the client to see what options
were actually negotiated for the mount and then check that neither
"soft" nor "intr" are there.

I suspect that the recovery thread in the client (called "nfscl") is
somehow wedged and cannot do the recovery from the bad session,
as well.
A "ps axHl" on the client would be useful to see what the
processes/threads are up to on the client when it is hung.

If increasing the number of nfsd threads in the server doesn't resolve
the problem, I'd guess it is some network weirdness caused by how
the bhyve instance is networked to its host. (I always use bridging
for bhyve instances and do NFS mounts, but I don't work those
mounts hard.)

Btw, "umount -N <mnt_path>" on the client will normally get rid
of a hung mount, although it can take a couple of minutes to complete.

rick


Access to the share suddenly stalled. Server reports this in messages,
every second:
   nfsrv_cache_session: no session IPaddr=192.168...

Restarting nfsd and mountd didn't help, only now the client started to
also report in messages, every second:
   nfs server 192.168...:/var/sysup/mnt/tmp.6.56160: is alive again

Mounting the same share anew to a different place works fine.

The network babble is this, every second:
   NFS request xid 1678997001 212 getattr fh 0,6/2
   NFS reply xid 1678997001 reply ok 52 getattr ERROR: unk 10052

Forensics: I tried to build openoffice on that share, a couple of
   times. So there was a bit of traffic, and some things may have
   overflown.

There seems to be no way to recover, only crashing the client.