[Bug 260011] Unresponsive NFS mount on AWS EFS

From: <bugzilla-noreply_at_freebsd.org>
Date: Fri, 27 May 2022 00:04:23 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=260011

Rick Macklem <rmacklem@FreeBSD.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Assignee|bugs@FreeBSD.org            |rmacklem@FreeBSD.org

--- Comment #17 from Rick Macklem <rmacklem@FreeBSD.org> ---
Created attachment 234241
  --> https://bugs.freebsd.org/bugzilla/attachment.cgi?id=234241&action=edit
handle bogus slot# replies for the Sequence op

cpercival@ emailed with some diagnostics (that I did not
realize were not in 13.0) which indicates that the Amazon
EFS server is pretty badly broken.
It sometimes (I don't know how frequently) returns the wrong
slotid for a session. (It is required by the RFC to be the same
as the request.)

Once this happens, there is no way to know which slot# the server
actually used.

This patch (which is rather large and, unfortunately, will not apply
to 13.0, but should apply to stable/13 and 13.1, I think?) marks both
of the slots (the one in the request and the one in the reply) bad,
so they will no longer be used.

When all slots get marked "bad", it does a DestroySession operation,
which should make subsequent uses of the session fail with
NFSERR_BADSESSION.
An NFSERR_BADSESSION reply should, in turn, start a recovery cycle
which should create a new session that can be used.

This patch has been tested against a hacked FreeBSD nfsd that replies
with a bogus slot# once every 100 RPCs and seems to work ok.

I have no idea if the Amazon EFS server will behave the same way, but
I am hoping cpercival@ will be able to test it.

I believe this serious bug in the Amazon EFS server would explain
your hangs.

-- 
You are receiving this mail because:
You are the assignee for the bug.