Re: support for pNFS with Linux as Data Servers

In reply to: David Chen : "Re: support for pNFS with Linux as Data Servers"
Go to: [ bottom of page ] [ top of archives ] [ this month ]
From: Rick Macklem <rick.macklem_at_gmail.com>
Date: Thu, 22 May 2025 13:41:21 UTC
On Wed, May 21, 2025 at 9:53 PM David Chen <david.chen@peakaio.com> wrote:
>
> > I don't think it will be a lot of work. You'll notice that the RPC functions
> > in nfs_clrpcops.c mostly handle NFSv2, NFSv3 and NFSv4. Changing
> > these to handle NFSv3 shouldn't be a big deal. (Admittedly a lot easier
> > for me to do, since I know how the code needs to be written.)
> > The one that will look quite different is NFSv3 Create instead of NFSv4
> > Open/Create.
>
> OK, thanks, I'll look into this!
>
> > I do see trying to do loosely coupled to Linux DS servers as a lot of
> > work since, as I noted before, the MDS needs to manage Open stateids
> > for all the DS files.
>
> Sorry for being dense, but to lay out my understanding so far: today
> with tightly coupled, we use NFSv4 RPCs to talk to the DS(s) from the
> MDS, and since it's tightly coupled, we can use a 0x5555 stateid and
> avoid managing stateids. With loosely coupled, we can't use a 0x5555
> stateid and would need to manage stateids if we continue to use NFSv4
> for this communication. So as you said we should just use NFSv3
> instead, which has no stateids to manage.
>
> Here I think you're talking about, when a NFS client talks to a DS
> e.g. to write to a file, if that communication is NFSv4, then the MDS
> must have first told the client what stateid to use (sent as part of
> the layout), and the management of that stateid (which originally came
> from the DS) is complicated. If that's what you're saying, then that
> makes sense to me too. If we avoid NFSv4 when the MDS sends RPCs to
> the DS(s) by using NFSv3 instead, and if we specify in the
> GETDEVICEINFO only NFSv3 and not NFSv4, then would we avoid managing
> any stateids?
Yes.

Note that, for a loosely coupled setup using NFSv4 DSs, the MDS would need
to do Opens on the DSs to get the stateids.
(The 0x5555... cheat only works for FreeBSD NFSv4 DSs and can only be
done for the tightly coupled case, since the permissions are correctly set on
the files on the DSs.)

>
> > The problem is when the client has already open'd the file and acquired
> > a rw layout for it (unlike a POSIX file system, NFS servers check permissions
> > on every I/O operation).
>
> Ahh, OK, thanks! I didn't realize with NFS the permission must be
> checked on every I/O, I assumed the POSIX behavior.
>
> > I can't recall if the CB_LAYOUTRECALL exercise is already done
> > for the tightly coupled case?
>
> I don't see that it's been done, but I could easily be missing it.
>
> I tried changing permissions when a client already has a file open,
> and got the following bad(?) behavior, using a completely stock
> FreeBSD pNFS server and a completely stock Linux client, but probably
> I made a mistake somewhere in the pNFS configuration or my testing:
>
> I configured pNFS using the instructions in pnfsserver(4). From a
> Linux client with two users ("userone" and "usertwo") both in the
> group "users", I did:
>
> 1) Create a file "testfile" with mode 664, ownership userone:users.
> 2) Open the file for writing as usertwo.
> 3) Change permissions to 644.
> 4) Write to the opened file.
>
> After step 4, the Linux NFS client gets stuck in a loop of WRITE
> (NFS4ERR_ACCESS), LAYOUTERROR (OK), LAYOUTRETURN (OK), 5 second pause,
> LAYOUTGET (OK), repeat. The client seems to be in a bad state at this
> point, e.g. if I unmount and remount the NFS share then the mount
> hangs.
At some point the Linux folk decided to no longer fall back to I/O against
the MDS.  I thought that was not the best idea.  Mostly they test against
a server that Hammerspace (a storage startup) runs.

A client is in a tough spot when a WRITE fails with EACCES, especially
if it is a delayed write-back. Btw, I think you'll find the above test
problematic
for other non-pNFS NFS mounts as well. (The NFS protocol cannot, by its
design, fully support POSIX semantics.)

>
> If I do the same steps with a FreeBSD client instead of a Linux one, I
> get the expected behavior, i.e. the write() does not successfully
> write but returns success, when I close the file I get an error, and
> the NFS client stays in a good state.
>
> Probably the Linux client case is supposed to behave the same as the
> FreeBSD client case, instead of getting stuck in a loop, and I've done
> something wrong?
Try a non-pNFS setup and see what the behaviour is then.

>
> In general, I'm confused that, assuming a client is allowed to use the
> same layout for both userone and usertwo in the example above, even if
> the layout is recalled and presumably a new layout issued, I don't see
> how a single layout can result in allowing write access for userone
> but denying access for usertwo. I can see that if all writes are
> directed through the MDS, then the MDS can enforce the access on each
> write, but I assume that would be a transient situation. Basically,
> fencing makes sense to me at the granularity of clients, but I don't
> see how fencing works when the issue at hand is controlling access at
> the granularity of users. I'm probably making more bad assumptions,
> just wish I knew what they are. Thanks!!
Once you are down in the buffer cache, there is no specific user. There
are simply blocks for a file. This usually works, because the above case
is not common. For the tightly coupled case, I/O to the DSs check permissions
using the owner/mode/ACL exactly like the MDS does, so it works the
same.

For the loosely coupled case, the situation is different and that is why
loosely coupled will be challenging to get correct, doing CB_LAYOUTRECALLs
and fencing. (I didn't go that way for a reason.)

rick