Re: support for pNFS with Linux as Data Servers

Reply: David Chen : "Re: support for pNFS with Linux as Data Servers"
In reply to: David Chen : "Re: support for pNFS with Linux as Data Servers"
Go to: [ bottom of page ] [ top of archives ] [ this month ]
From: Rick Macklem <rick.macklem_at_gmail.com>
Date: Tue, 13 May 2025 21:53:46 UTC
On Tue, May 13, 2025 at 4:22 AM David Chen <david.chen@peakaio.com> wrote:
>
> Thanks for the reply!
>
> > I think the hard part may be implementation of the "fencing" that is
> > required for the loosely coupled model. (For what is there now,
> > permission handling is done by the DS(s), since the files on the DS(s)
> > have exactly the same owner/group/mode/ACL as the MDS.)
>
> That makes sense to me! Also I'm not sure what's required but handling
> restarts and recovery looks potentially hairy.
>
> > It has been a while since I read the RFC, so I cannot recall how the
> > loosely coupled model supports permissions (fencing off files that the
> > client does not have permission to access).
>
> If I understand RFC 8435 correctly, the MDS uses the synthetic uid and
> gid to allow access, and unilaterally prevents access by changing the
> owner uid and gid on the DSs.
See below.

>
> > Although it does not explicitly say so in the RFC, you want to use NFSv3
> > RPCs to talk to the DS(s) from the MDS for the loosely coupled variant.
> > (That avoids any stateid hassles. For NFSv4 DSs, the MDS would have to
> > do Opens and keep open_stateids for the DS files.)
>
> It makes sense to me that the MDS should tell clients to use NFSv3 to
> talk to the DSs, e.g. to avoid stateid hassles. And ideally the MDS
> would talk to the DSs using NFSv3 too, at least for the simplicity of
> the DSs only talking NFSv3. But there seems to be plenty of existing
> code where the MDS is using NFSv4 to proxy operations to the DS,
> and making that work with only NFSv3 instead looks non-trivial? So I
> wonder how bad it would be to leave that existing code as NFSv4.
I don't think it will be a lot of work. You'll notice that the RPC functions
in nfs_clrpcops.c mostly handle NFSv2, NFSv3 and NFSv4. Changing
these to handle NFSv3 shouldn't be a big deal. (Admittedly a lot easier
for me to do, since I know how the code needs to be written.)
The one that will look quite different is NFSv3 Create instead of NFSv4
Open/Create.

I do see trying to do loosely coupled to Linux DS servers as a lot of
work since, as I noted before, the MDS needs to manage Open stateids
for all the DS files.

>
> > > The other issue is clients will use the synthetic
> > > uid/gid given by the MDS (currently 999/999), and this results in
> > > access errors when the clients talk to the DSs.
> > The NFSv3 Create RPC that creates the DS file would set it owned
> > by the uid/gid and mode 0600, I think?
>
> I didn't understand you here, but if you don't mind, let's gloss over
> it for now to talk about fencing first...
I haven't read the RFC in a long time. As you noted above, it changes
the file ownership (uid, gid) to fence it off. My vague recollection was
changing mode.
Either way, it's a Setattr of the file on the DS.

It's the layout recall that is the hard/slow part. For delegation recall,
the server usually ends up replying NFS4ERR_DELAY multiple times
while waiting for the client(s) to return the delegations.
I think the same will happen here but, again, I haven't read the RFC
in quite a while.

>
> > As I noted, I think "fencing" is where most of the work is.
> > If I recall it correctly, it goes something like this:
> > - Client does a Setattr of owner/group/mode/ACL on the MDS.
> > --> Server must recall all layouts for the file via CB_RECALLLAYOUT
> >      callbacks and reply NFS4ERR_DELAY to the Setattr.
> > --> Sometime later, the client retries the Setattr. If all layouts have been
> >      returned, it is done. If not, the server must either return NFS4ERR_DELAY
> >      again or change the mode on the file on the DS(s) so that clients cannot
> >      access it. I think the MDS must wait at least one lease duration (2min)
> >      after issuing the CB_RECALLLAYOUTs before doing this.
>
> I see a few places where the RFCs mention that fencing should or must
> happen: when a file's permissions or ACLs are changed, a client lease
> expires, there's an admin revoke of a client, a client doesn't respond
> to CB_LAYOUTRECALL, and probably I missed some others.
>
> It makes sense to me that fencing should (and maybe must) happen in
> all of these situations... except for the permissions changing case,
> that doesn't make sense to me yet, and it'd be great if you could
> enlighten me. If the scenario is like:
>
> 1) Some user creates a file with mode 644, writes, closes it
> 2) Someone changes the permissions to 444 or changes the owner
> 3) The MDS doesn't fence the client even though it's supposed to
> 4) The user tries to open the file for writing. The MDS will return an
> access error for this open, even without fencing. So no problem here(?)
The problem is when the client has already open'd the file and acquired
a rw layout for it (unlike a POSIX file system, NFS servers check permissions
on every I/O operation).
The problem is that the client won't be able to flush cached dirty data to
the DS once the mode/owner/group/ACL have changed.
Basically a CB_LAYOUTRECALL means the client must:
- Block further writing to the file (make all writes fail with EACCES) or
  similar.
- Do write(s) to flush all dirty cached data to the DS. (These were already
  done as successful write(2) syscalls.)
- Do a Commit to ensure the DS has the data committed to stable storage.
- Return the layout.

While this is happening, the MDS cannot actually do the Setattr of
mode/owner/group/ACL and can only reply NFS4ERR_DELAY.
Once all client(s) have done their layout return, it can safely do the
Setattr.
--> It also can give up on clients that do not do a layout return within
      a lease duration after getting the NFS_OK reply from the client
      for the CB_LAYOUTRECALL.

(Doing this for delegations is one of the messiest things the server
does.)

I can't recall if the CB_LAYOUTRECALL exercise is already done
for the tightly coupled case?

>
> Or if instead the scenario is like:
>
> 1) Some user creates a file with mode 644, writes, does NOT close
> 2) Someone changes the permissions to 444 or changes the owner
> 3) The MDS doesn't fence the client even though it's supposed to
> 4) The user tries to write more to the file. The additional writes are
> successful. This doesn't sound like a problem to me either, as it's
> normal for writes to be succesful in this case(?)
>
> There must be some scenario where there's a bad effect unless the MDS
> fences, but I don't know what it is. Or maybe I'm thinking about it
> all wrong.
>
> Thanks!