[Bug 260011] Unresponsive NFS mount on AWS EFS

From: <bugzilla-noreply_at_freebsd.org>
Date: Wed, 24 Nov 2021 09:08:04 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=260011

            Bug ID: 260011
           Summary: Unresponsive NFS mount on AWS EFS
           Product: Base System
           Version: 13.0-RELEASE
          Hardware: Any
                OS: Any
            Status: New
          Severity: Affects Only Me
          Priority: ---
         Component: kern
          Assignee: bugs@FreeBSD.org
          Reporter: ale@FreeBSD.org

I'm experiencing annoying issues with an AWS EFS mountpoint on FreeBSD 13 EC2
instances. The filesystem is mounted by 3 instances (2 with the same access
patterns, 1 with a different one)

Initially I had the /etc/fstab entry configured with: 

`rw,nosuid,noatime,bg,nfsv4,minorversion=1,rsize=1048576,wsize=1048576,timeo=600,oneopenown`

and this after a few days led my java application to have all threads blocked
on never returning `stat64` kernel calls, without the ability to even kill -9
the process.

After digging it up it seems the normal behavior for hard mount points, even if
I fail to understand why one should prefer to have the system completely
freezed when the NFS mount point is not responding.

So I later changed the configuration with:

`rw,nosuid,noatime,bg,nfsv4,minorversion=1,intr,soft,retrans=2,rsize=1048576,wsize=1048576,timeo=600,oneopenown`

by adding `intr,soft,retrans=2`.

Btw, I think there is a typo in mount_nfs(8), it says to set `retrycnt` instead
of `retrans` for the `soft` option, can you confirm?

After the change `nfsstat -m` reports:
`nfsv4,minorversion=1,oneopenown,tcp,resvport,soft,intr,cto,sec=sys,acdirmin=3,acdirmax=60,acregmin=5,acregmax=60,nametimeo=60,negnametimeo=60,rsize=65536,wsize=65536,readdirsize=65536,readahead=1,wcommitsize=16777216,timeout=120,retrans=2`

I wonder why it seems that the timeo,rsize,wsize have been ignored, but this is
irrelevant to the issue.

After a few days the application on the two similar EC2 instances stopped
working again, though. Any command accessing the mounted efs filesystem didn't
complete in reasonable time (ls, df, umount, etc.), but I could kill the
processes. The only way to recover the situation was to reboot the instances,
though.

On one of them I've seen the following kernel messages, but they have been
generated only when I tried to debug the issue hours later, and only on one EC2
instance, so I'm not sure if they are relevant or helpful:

```
kernel: newnfs: server 'fs-xxx.efs.us-east-1.amazonaws.com' error: fileid
changed. fsid 0:0: expected fileid 0x4d2369b89a58a920, got 0x2. (BROKEN NFS
SERVER OR MIDDLEWARE)
kernel: nfs server fs-xxx.efs.us-east-1.amazonaws.com:/: not responding
```

The third EC2 instance survived and was still able to access the filesystem,
but I think it wasn't accessing the filesystem when there have been the
network/nfs issue  that affected the two others.

-- 
You are receiving this mail because:
You are the assignee for the bug.