Re: NFS 4.2 "RPC struct is bad" revisited (with much more detail)

In reply to: Rick Macklem : "Re: NFS 4.2 "RPC struct is bad" revisited (with much more detail)"
Go to: [ bottom of page ] [ top of archives ] [ this month ]
From: Rick Macklem <rick.macklem_at_gmail.com>
Date: Sat, 07 Dec 2024 22:44:22 UTC
On Sat, Dec 7, 2024 at 2:42 PM Rick Macklem <rick.macklem@gmail.com> wrote:
>
> On Mon, Dec 2, 2024 at 12:23 PM J David <j.david.lists@gmail.com> wrote:
> >
> > On Sun, Dec 1, 2024 at 8:03 PM Rick Macklem <rick.macklem@gmail.com> wrote:
> > > Well, this indicates the Debian server is broken. A bitmap and associated
> > > attribute values are required for a GETATTR reply of NFS4_OK.
> > > This clearly says they are not there.
> > >
> > > That would result in the client saying the RPC is bad.
> >
> > Even if the response to that isn't "A problem that occurs only with
> > FreeBSD clients is a FreeBSD client problem; it shouldn't do the thing
> > that causes that to happen," it could take quite some time for any
> > change made by the linux-nfs crowd to filter through to reaching a
> > production Debian release.
> >
> > Is there a reasonable way to apply Postel's law here and modify the
> > client to warn on but accept this behavior rather than erroring out in
> > a way that renders the file structure unusable indefinitely?
> Probably not.
>
> First is the question of what failure went on-the-wire:
> (A) - The record mark length for the message was correct, but the
>          message did not have any GETATTR reply data.
> or
> (B) - The record mark length was wrong and the GETATTR reply data
>          came after the end-of-record as indicated by the record mark
>          that precedes each RPC message.
Oh, and although it is not easy for the client to tell if the failure
is (A) vs (B),
it can be determined by looking at the packet trace in wireshark, as I
described.

If you do not want to do this but are willing to provide the pcap file to me,
I can take a look and quickly determine if it is (A) vs (B).

rick

> If it is (B), the TCP connection is screwed up, since there is no way
> to re-synchronize to the start of the next RPC message. All a client
> can do in this case is create a new TCP connection and retry all
> outstanding RPCs. (Your initial post suggested that this might be
> happening?)
>
> If it is (A), then for the specific case of GETATTR not receiving
> valid data after a READDIR, it might be ok to ignore the failure.
> However, GETATTRs happen a lot and there are many places
> where no reply data is a serious problem. For example, the
> client might not even know what type of file object (regular file,
> directory,...) the object is.
> --> The GETATTR replies are all processed in the same place
>       and, as such, it is not known that this reply comes after a
>        READDIR.
> If there was one reproducible case where a widely used Linux
> server was known to fail, it might be possible to come up with a
> workaround hack. However, you are the only one reporting this
> problem as far as I can recall and it appears to be intermittent.
> (ie. It could be that GETATTRs fail to reply with proper data for
> other cases, but it is this case that you captured packets for,)
>
> Finally, why would you assume that putting a fix in the FreeBSD
> client is somehow easier and less logistically time consuming
> compared to fixing a Linux server.
>
> Note that I hinted at how you might isolate why/how the Linux
> server is broken. In doing so, I did not intend to suggest that
> it was even a software issue. I simply do not know.
> (For example, have you looked hard for any evidence that there
> is a hardware issue w.r.t. that server?)
>
> rick
>
> >
> > Even refusing to cache this response if it is unusable would probably
> > be an improvement.
> >
> > Thanks!