Re: nfs hang

In reply to: Ronald Klop : "Re: nfs hang"
Go to: [ bottom of page ] [ top of archives ] [ this month ]
From: Rick Macklem <rick.macklem_at_gmail.com>
Date: Thu, 13 Nov 2025 23:06:46 UTC
On Thu, Nov 13, 2025 at 9:19 AM Ronald Klop <ronald@freebsd.org> wrote:
>
> Op 13-11-2025 om 14:06 schreef Rick Macklem:
> > On Thu, Nov 13, 2025 at 2:45 AM Ronald Klop <ronald@freebsd.org> wrote:
> >>
> >> Op 13-11-2025 om 11:41 schreef Ronald Klop:
> >>> Hi,
> >>>
> >>> I have setup nfsd in a jail. It exports zfs fs. The kernel is 16-CURRENT/aarch64. Jails are 14.3-RELEASE.
> >>> $ cat /data/jails/pkg/_root/etc/exports
> >>> V4:    /    -sec=sys
> >>>
> >>> /usr/local/poudriere/data/logs/bulk    -sec=sys -maproot=root
> >>> /usr/local/poudriere/data/packages    -sec=sys -maproot=root
> >>>
> >>> /usr/ports    -sec=sys
> >>>
> >>>
> >>> The clients run poudriere in jails.
> >>>
> >>> Now and than I get hanging processes and unresponsive nfs server messages.
> >>>
> >>> All NFS threads are in this state:
> >>> [root@rpi4 ~]# procstat -kk 5973
> >>>     PID    TID COMM                TDNAME              KSTACK
> >>>    5973 100541 nfsd                nfsd: master        mi_switch+0x100 sleepq_catch_signals+0x3e4 sleepq_timedwait_sig+0x18 _sleep+0x1a0 clnt_vc_call+0x814 clnt_reconnect_call+0x960 newnfs_request+0xacc nfsrpc_closerpc+0xfc nfscl_tryclose+0x58 nfsrpc_doclose+0x294 nfscl_doclose+0x390 nfsrpc_close+0x28 ncl_inactive+0x14c vop_sigdefer+0x34 vinactivef+0xb8 vput_final+0x1f4 null_reclaim+0x1a0 VOP_RECLAIM_APV+0x20
> >>>    5973 100784 nfsd                nfsd: service       mi_switch+0x100 sleepq_catch_signals+0x3e4 sleepq_timedwait_sig+0x18 _sleep+0x1a0 clnt_vc_call+0x814 clnt_reconnect_call+0x960 newnfs_request+0xacc nfsrpc_closerpc+0xfc nfscl_tryclose+0x58 nfsrpc_doclose+0x294 nfscl_doclose+0x390 nfsrpc_close+0x28 ncl_inactive+0x14c vop_sigdefer+0x34 vinactivef+0xb8 vput_final+0x1f4 null_reclaim+0x1a0 VOP_RECLAIM_APV+0x20
> >>>    5973 100785 nfsd                nfsd: service       mi_switch+0x100 sleepq_catch_signals+0x3e4 sleepq_timedwait_sig+0x18 _sleep+0x1a0 clnt_vc_call+0x814 clnt_reconnect_call+0x960 newnfs_request+0xacc nfsrpc_closerpc+0xfc nfscl_tryclose+0x58 nfsrpc_doclose+0x294 nfscl_doclose+0x390 nfsrpc_close+0x28 ncl_inactive+0x14c vop_sigdefer+0x34 vinactivef+0xb8 vput_final+0x1f4 null_reclaim+0x1a0 VOP_RECLAIM_APV+0x20
> >>>    5973 100786 nfsd                nfsd: service       mi_switch+0x100 sleepq_catch_signals+0x3e4 sleepq_timedwait_sig+0x18 _sleep+0x1a0 clnt_vc_call+0x814 clnt_reconnect_call+0x960 newnfs_request+0xacc nfsrpc_closerpc+0xfc nfscl_tryclose+0x58 nfsrpc_doclose+0x294 nfscl_doclose+0x390 nfsrpc_close+0x28 ncl_inactive+0x14c vop_sigdefer+0x34 vinactivef+0xb8 vput_final+0x1f4 null_reclaim+0x1a0 VOP_RECLAIM_APV+0x20
> >>> ... and a couple more similar lines ...
> >>>
> >>> In rc.conf:
> >>> nfs_server_enable=YES
> >>> mountd_enable=YES
> >>> nfsv4_server_only=YES
> >>> nfs_server_flags="-t"
> >>>
> >>> The filesystems are a zfs legacy mount in the jail:
> >>> # grep zfs /data/jails/pkg/fstab
> >>> zrpi4/data/poudriere-logs-bulk    /data/jails/pkg/_root/usr/local/poudriere/data/logs/bulk    zfs    rw    0    0
> >>> zrpi4/data/poudriere-packages    /data/jails/pkg/_root/usr/local/poudriere/data/packages    zfs    rw    0    0
> >>> zdata4/ports    /data/jails/pkg/_root/usr/ports    zfs    rw    0    0
> >>>
> >>>
> >>> Interestingly I also have a bash process hanging which should not access NFS at the moment:
> >>> # procstat -kk 83175
> >>>     PID    TID COMM                TDNAME              KSTACK
> >>> 83175 111203 bash                -                   mi_switch+0x100 sleeplk+0xf8 lockmgr_slock_hard+0x29c _vn_lock+0x50 vget_finish+0x28 cache_fplookup_final_child+0x54 cache_fplookup+0x538 namei+0xd8 kern_statat+0xd4 sys_fstatat+0x2c do_el0_sync+0x6b4 handle_el0_sync+0x4c
> >>>
> >>> Any thoughts?
> >>>
> >>> Regards,
> >>> Ronald.
> >>>
> >>>
> >>
> >>
> >> Just noticed this on the console:
> >> newnfs: server 'pkg.thuis.klop.ws' error: fileid changed. fsid b7c1283e:5166a2de: expected fileid 0x22, got 0x2. (BROKEN NFS SERVER OR MIDDLEWARE)
> >> newnfs: server 'pkg.thuis.klop.ws' error: fileid changed. fsid 3750dc87:289259de: expected fileid 0x22, got 0x2. (BROKEN NFS SERVER OR MIDDLEWARE)
> >> newnfs: server 'pkg.thuis.klop.ws' error: fileid changed. fsid 263bc0f2:d39c94de: expected fileid 0x22, got 0x2. (BROKEN NFS SERVER OR MIDDLEWARE)
> >> newnfs: server 'pkg.thuis.klop.ws' error: fileid changed. fsid b7c1283e:5166a2de: expected fileid 0x22, got 0x2. (BROKEN NFS SERVER OR MIDDLEWARE)
> >> newnfs: server 'pkg.thuis.klop.ws' error: fileid changed. fsid 3750dc87:289259de: expected fileid 0x22, got 0x2. (BROKEN NFS SERVER OR MIDDLEWARE)
> >> newnfs: server 'pkg.thuis.klop.ws' error: fileid changed. fsid 263bc0f2:d39c94de: expected fileid 0x22, got 0x2. (BROKEN NFS SERVER OR MIDDLEWARE)
> >> newnfs: server 'pkg.thuis.klop.ws' error: fileid changed. fsid b7c1283e:5166a2de: expected fileid 0x22, got 0x2. (BROKEN NFS SERVER OR MIDDLEWARE)
> >> newnfs: server 'pkg.thuis.klop.ws' error: fileid changed. fsid 3750dc87:289259de: expected fileid 0x22, got 0x2. (BROKEN NFS SERVER OR MIDDLEWARE)
> >> newnfs: server 'pkg.thuis.klop.ws' error: fileid changed. fsid b7c1283e:5166a2de: expected fileid 0x22, got 0x2. (BROKEN NFS SERVER OR MIDDLEWARE)
> >> newnfs: server 'pkg.thuis.klop.ws' error: fileid changed. fsid 3750dc87:289259de: expected fileid 0x22, got 0x2. (BROKEN NFS SERVER OR MIDDLEWARE)
> >> newnfs: Logged 10 times about fileid corruption; going quiet to avoid spamming logs excessively. (Limit is: 10).
> > Do you have more than one client mounting the file system?
> > If you do, make sure they all have different /etc/hostid's.
> > (Cloning a system disk without deleting /etc/hostid can
> > result in multiple clients with the same /etc/hostid. That
> > mean they are "the same client" to the NFSv4 server
> > and that can cause the above.)
> >
> > If this is not the problem, I don't know why you'd see the
> > above but I suspect the above explains the hang.
> >
> > rick
> >
>
>
> Two clients. Both have different /etc/hostid.
>
> I noticed that the procstat stacks start with "null_reclaim". And poudriere null-mounts the nfs mounts in the poudriere-jails.
>
> Could nfs+nullfs give some trouble?
Yes, although I cannot say why. A null_reclaim would mean that the
nullfs vnode is being reclaimed. However, if the underlying NFS vnode
is still Open by some file descriptor, it cannot be NFSv4 Closed at that
point. (An NFSv4 Close cannot happen until all opens by all filedescriptors
for the file have been closed, because the NFS client below the VOP_XXX()
layer cannot know which NFSv4 Open (done by different open_owners, which
refer to process on client) are being closed.

You might try the "oneopenown" NFSv4 mount option. That makes the
NFSv4.1/4.2 client use a single NFSv4 Open for all opens of the same
file (it cannot be done for NFSv4.0, but you probably are using NFSv4.1/4.2
mounts).

rick

> Or maybe it is just nullfs that hangs everything and the nfs stuff is just a result of it.
>
> At the same moment I had git hanging on a non-NFS mount. See attachment for the procstat which also includes nfs-calls.
>
> Regards,
> Ronald.