[Bug 260011] Unresponsive NFS mount on AWS EFS

From: <bugzilla-noreply_at_freebsd.org>
Date: Wed, 24 Nov 2021 15:44:42 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=260011

--- Comment #1 from Rick Macklem <rmacklem@FreeBSD.org> ---
Assorted comments:
- FreeBSD 13.0 shipped with a bug in the TCP stack which could
  result in a missed socket receive upcall.
  --> This normally causes problems for the server, but could
      cause problems for the client side as well.
  When you do a "netstat -a" on the hung system, the TCP connection
  for the mount is shown as ESTABLISHED with Recv-Q non-zero.
  --> The fix is tp upgrade to stable/13.
  There is more info on this in PR#254590.

- AWS uses a small, fixed number of open_owners, which is why
  "oneopenown" is needed. If it also uses a small, fixed number
  of lock_owners (for byte range locking instead of Windows Open locks),
  then you could run out of these.
  --> If so, you are SOL (Shit Outa Luck) and my only suggestion
      would be to try an nfsv3 mount with the "nolockd" mount option.
  "netstat -E -c" should show you how many lock_owners are allocated
  at the time of the hang.

Other than that, if AWS has now added support for delegations, there
could be assorted breakage. Not running the nfscbd(8) daemon should
avoid isuuing of delegations, if that is the case.

If a soft mount fails a syscall, then the session slot is screwed up
and this makes the mount fail in weird ways.
"umount -N <mnt_path>" is a much better way to deal with hung mounts.

As a starting point, posting the output of:
ps axHl
procstat -kk
netstat -a
nfsstat -E -c
on the client when hung will give us more information.

Good luck with it, rick

-- 
You are receiving this mail because:
You are the assignee for the bug.