Re: NFS 4.2 "RPC struct is bad" revisited (with much more detail)

Reply: Rick Macklem : "Re: NFS 4.2 "RPC struct is bad" revisited (with much more detail)"
Reply: J David : "Re: NFS 4.2 "RPC struct is bad" revisited (with much more detail)"
In reply to: J David : "Re: NFS 4.2 "RPC struct is bad" revisited (with much more detail)"
Go to: [ bottom of page ] [ top of archives ] [ this month ]

From: Rick Macklem <rick.macklem_at_gmail.com>
Date: Sat, 07 Dec 2024 22:42:05 UTC

On Mon, Dec 2, 2024 at 12:23 PM J David <j.david.lists@gmail.com> wrote:
>
> On Sun, Dec 1, 2024 at 8:03 PM Rick Macklem <rick.macklem@gmail.com> wrote:
> > Well, this indicates the Debian server is broken. A bitmap and associated
> > attribute values are required for a GETATTR reply of NFS4_OK.
> > This clearly says they are not there.
> >
> > That would result in the client saying the RPC is bad.
>
> Even if the response to that isn't "A problem that occurs only with
> FreeBSD clients is a FreeBSD client problem; it shouldn't do the thing
> that causes that to happen," it could take quite some time for any
> change made by the linux-nfs crowd to filter through to reaching a
> production Debian release.
>
> Is there a reasonable way to apply Postel's law here and modify the
> client to warn on but accept this behavior rather than erroring out in
> a way that renders the file structure unusable indefinitely?
Probably not.

First is the question of what failure went on-the-wire:
(A) - The record mark length for the message was correct, but the
         message did not have any GETATTR reply data.
or
(B) - The record mark length was wrong and the GETATTR reply data
         came after the end-of-record as indicated by the record mark
         that precedes each RPC message.
If it is (B), the TCP connection is screwed up, since there is no way
to re-synchronize to the start of the next RPC message. All a client
can do in this case is create a new TCP connection and retry all
outstanding RPCs. (Your initial post suggested that this might be
happening?)

If it is (A), then for the specific case of GETATTR not receiving
valid data after a READDIR, it might be ok to ignore the failure.
However, GETATTRs happen a lot and there are many places
where no reply data is a serious problem. For example, the
client might not even know what type of file object (regular file,
directory,...) the object is.
--> The GETATTR replies are all processed in the same place
      and, as such, it is not known that this reply comes after a
       READDIR.
If there was one reproducible case where a widely used Linux
server was known to fail, it might be possible to come up with a
workaround hack. However, you are the only one reporting this
problem as far as I can recall and it appears to be intermittent.
(ie. It could be that GETATTRs fail to reply with proper data for
other cases, but it is this case that you captured packets for,)

Finally, why would you assume that putting a fix in the FreeBSD
client is somehow easier and less logistically time consuming
compared to fixing a Linux server.

Note that I hinted at how you might isolate why/how the Linux
server is broken. In doing so, I did not intend to suggest that
it was even a software issue. I simply do not know.
(For example, have you looked hard for any evidence that there
is a hardware issue w.r.t. that server?)

rick

>
> Even refusing to cache this response if it is unusable would probably
> be an improvement.
>
> Thanks!