Documentation and debugging for NFSv4

Sat May 23 16:22:51 UTC 2020

Doug and Remy, hello.

Thanks for your additional observations.

On 22 May 2020, at 19:26, Doug McIntyre wrote:

> On Fri, May 22, 2020 at 03:15:01PM +0100, Norman Gray wrote:
>> I'm having difficulty finding consistent documentation and debugging 
>> tools
>> for NFSv4.  Is there some handbook-like source that I'm missing?  Or 
>> some
>> layer of documentation for configuration or debugging that I've 
>> failed to
>> find?
>
> I think in general, that NFSv4 is not widely deployed outside of
> hetrogenous linux environments. Given the state of things, I'd imagine
> it is downgraded to NFSv3 more often than not in other use cases of 
> mixed
> OSes.

That doesn't seem to be the case with me.  Even an Ubuntu12 client seems 
happy to make an NFSv4 mount from this server (though I think it's 4.0), 
and I have a CentOS 7.8 machine similarly happy with 4.1.

But it's another CentOS 7.8 client which refuses to make the connection. 
  (there's an aha below...)

It also turns out that a FreeBSD 11.3 client can't mount this unless 
-overs=4 is explicitly provided on the mount command (this is actually 
explained in the mount_nfs(8) manpage (!), which says that the default 
strategy is to try 3 then 2).

Succeeding:

     ubuntu12# mount -tnfs server:/astro/home /mnt

     centos78 at a# mount -tnfs server:/astro/home /mnt

     freebsd113# mount -tnfs -overs=4 server:/astro/norman /mnt

Aha...

Failing: centos78 at b and Ubuntu14 at b.... aha.  I have FINALLY found some 
consistency to the machines which fail: they're all in a different DNS 
(sub)domain to the server, though in the same netblocks as the machines 
which succeed.

Specifically, they fail during the ls: the client sends a PUTFH and a 
READDIR opcode, and the PUTFH succeeds but the READDIR fails, with a 
NFS4ERR_NOFILEHANDLE, which seems to suggest that the FH that the PUTFH 
sent wasn't saved (if I'm understanding RFC 3530 for NFSv4.0 and RFC 
5661 for v4.1, correctly).  In the case of the machines which succeed, 
in subdomain @a, the corresponding NFS request looks pretty much 
identical, but the response is successful.  (An odd thing is that in the 
_successful_ cases, Wireshark shows a lot of 'TCP ACKed unseen segment' 
warnings, but I can't see how this might be relevant)

This is the only consistency I can see, but I can't see how this is 
relevant to anything.

   * NFS works at the TCP layer, after resolution of DNS names
   * There are no domain names that I can see in the tcpdump traffic
   * The only mentions of domains in RFC 5661 are irrelevant to this.  
The domain names in the context of owners and groups are not, I think, 
relevant.

>> Normally some combination of netstat and tcpdump would make some 
>> headway,
>> but SunRPC is blacker magic than that.
>
> NFSv4 is a big change, most implementations I've seen operate over TCP 
> instead of UDP
> whereas TCP was optional in v2 and v3.

As I read RFC 3530, NFSv4 is TCP-only (well, TCP and SCTP), and doesn't 
use UDP at all.

> NFSv4 doesn't need rpc portmapper, nor
> other helper daemons. The IDmapper is a big change as well, no more 
> UID passed
> through, but all UIDs have to be mapped back and forth on both sides.

Also from Remy:

> Is nfsuserd running? According to the man page, it is needed for NFSv4 
> to work properly.

I think the string-based user and group information is strictly 
optional, though usual/recommended.  nfsv4(4) indicates that that the 
wire protocol can have strings or numbers-containing-strings (and cf RFC 
5661 Sect.5.9).

At any rate, I see the _same_ behaviour both with and without nfsuserd 
running on the server.  When it's running, I see domain names in the 
responses to GETATTR requests, but nowhere in the requests that are 
failing.

This is however the best clue so far, since I can see that `nfsidmap -d` 
produces different results on the two sets of client machines.  I wonder 
if this worked by accident -- due to a default configuration -- without 
nfsuserd before.  It appears that I'll need to learn more about what 
role that has in the protocol.

I wonder if the relevant id is somehow encoded into the (I thought 
opaque) filehandles that are passed back and forth.

> Make sure you use V4 definitions in /etc/exports.  From what I
> remember even connecting as a client needed 'V4: /' in there to
> connect right to a linux NFSv4 server, but I could be misremembering.

That's right.  The presence of the 'V4' in the /etc/exports appears to 
be what enables NFSv4 service (exports(5) seems to vaguely suggest this 
without being explicit).

Thanks for your thoughts; any others most welcome.

Best wishes,

Norman

-- 
Norman Gray  :  http://www.astro.gla.ac.uk/users/norman/it/
Research IT Coordinator  :  School of Physics and Astronomy
// My current template week for IT tasks is: Monday, Tuesday, and Friday