pNFS server Plan B

Tue Jun 14 22:35:47 UTC 2016

Doug Rabson wrote:
> As I mentioned to Rick, I have been working on similar lines to put
> together a pNFS implementation. Comments embedded below.
> 
> On 13 June 2016 at 23:28, Rick Macklem <rmacklem at uoguelph.ca> wrote:
> 
> > You may have already heard of Plan A, which sort of worked
> > and you could test by following the instructions here:
> >
> > http://people.freebsd.org/~rmacklem/pnfs-setup.txt
> >
> > However, it is very slow for metadata operations (everything other than
> > read/write) and I don't think it is very useful.
> >
> > After my informal talk at BSDCan, here are some thoughts I have:
> > - I think the slowness is related to latency w.r.t. all the messages
> >   being passed between the nfsd, GlusterFS via Fuse and between the
> >   GlusterFS daemons. As such, I don't think faster hardware is likely
> >   to help a lot w.r.t. performance.
> > - I have considered switching to MooseFS, but I would still be using Fuse.
> >   *** MooseFS uses a centralized metadata store, which would imply only
> >       a single Metadata Server (MDS) could be supported, I think?
> >       (More on this later...)
> > - dfr@ suggested that avoiding Fuse and doing everything in userspace
> >   might help.
> > - I thought of porting the nfsd to userland, but that would be quite a
> >   bit of work, since it uses the kernel VFS/VOP interface, etc.
> >
> 
> I ended up writing everything from scratch as userland code rather than
> consider porting the kernel code. It was quite a bit of work :)
> 
> 
> >
> > All of the above has led me to Plan B.
> > It would be limited to a single MDS, but as you'll see
> > I'm not sure that is as large a limitation as I thought it would be.
> > (If you aren't interested in details of this Plan B design, please
> >  skip to "Single Metadata server..." for the issues.)
> >
> > Plan B:
> > - Do it all in kernel using a slightly modified nfsd. (FreeBSD nfsd would
> >   be used for both the MDS and Data Server (DS).)
> > - One FreeBSD server running nfsd would be the MDS. It would
> >   build a file system tree that looks exactly like it would without pNFS,
> >   except that the files would be empty. (size == 0)
> >   --> As such, all the current nfsd code would do metadata operations on
> >       this file system exactly like the nfsd does now.
> > - When a new file is created (an Open operation on NFSv4.1), the file would
> >   be created exactly like it is now for the MDS.
> >   - Then DS(s) would be selected and the MDS would do
> >     a Create of a data storage file on these DS(s).
> >     (This algorithm could become interesting later, but initially it would
> >      probably just pick one DS at random or similar.)
> >     - These file(s) would be in a single directory on the DS(s) and would
> > have
> >       a file name which is simply the File Handle for this file on the
> >       MDS (an FH is 28bytes->48bytes of Hex in ASCII).
> >
> 
> I have something similar but using a directory hierarchy to try to avoid
> any one directory being excessively large.
> 
I thought of that, but since no one will be doing an "ls" of it, I wasn't going to
bother doing multiple dirs initially. However, now that I think of it, the Create
and Remove RPCs will end up doing VOP_LOOKUP()s, so breaking these up into multiple
directories sounds like a good idea. (I may just hash the FH and let the hash choose
a directory.)

Good suggestion, thanks.

> 
> >   - Extended attributes would be added to the Metadata file for:
> >     - The data file's actual size.
> >     - The DS(s) the data file in on.
> >     - The File Handle for these data files on the DS(s).
> >   This would add some overhead to the Open/create, which would be one
> >   Create RPC for each DS the data file is on.
> >
> 
> An alternative here would be to store the extra metadata in the file itself
> rather than use extended attributes.
> 
Yep. I'm not sure if there is any performance advantage of doing data vs. extended attributes?

> 
> > *** Initially there would only be one file on one DS. Mirroring for
> >     redundancy can be added later.
> >
> 
> The scale of filesystem I want to build more or less requires the extra
> redundancy of mirroring so I added this at the start. It does add quite a
> bit of complexity to the MDS to keep track of which DS should have which
> piece of data and to handle DS failures properly, re-silvering data etc.
> 
> 
> >
> > Now, the layout would be generated from these extended attributes for any
> > NFSv4.1 client that asks for it.
> >
> > If I/O operations (read/write/setattr_of_size) are performed on the
> > Metadata
> > server, it would act as a proxy and do them on the DS using the extended
> > attribute information (doing an RPC on the DS for the client).
> >
> > When the file is removed on the Metadata server (link cnt --> 0), the
> > Metadata server would do Remove RPC(s) on the DS(s) for the data file(s).
> > (This requires the file name, which is just the Metadata FH in ASCII.)
> >
> 
> Currently I have a non-nfs control protocol for this but strictly speaking
> it isn't necessary as you note.
> 
> 
> >
> > The only addition that the nfsd for the DS(s) would need would be a
> > callback
> > to the MDS done whenever a client (not the MDS) does
> > a write to the file, notifying the Metadata server the file has been
> > modified and is now Size=K, so the Metadata server can keep the attributes
> > up to date for the file. (It can identify the file by the MDS FH.)
> >
> 
> I don't think you need this - the client should perform LAYOUTCOMMIT rpcs
> which will inform the MDS of the last write position and last modify time.
> This can be used to update the file metadata. The Linux client does this
> before the CLOSE rpc on the client as far as I can tell.
> 
When I developed the NFSv4.1_Files layout client, I had three servers to test
against.
- The Netapp filer just returned EOPNOTSUPP for LayoutCommit.
- The Linux test server (had MDS and DS on the same Linux system) accepted the
  LayoutCommit, but didn't do anything for it, so doing it had no effect.
- The only pNFS server I've ever tested against that needed LayoutCommit was
  Oracle/Solaris and the Oracle folk never explained why their server required
  it or what would break if you didn't do it. (I don't recall attributes being
  messed up when I didn't do it correctly.)
As such, I've never been sure what it is used for.

I need to read the LayoutCommit stuff in the RFC and Flex Files draft again.
It would be nice if the DS->MDS calls could be avoided for every write.
Doing one when the DS receives a Commit RPC wouldn't be too bad.

> 
> >
> > All of this is a relatively small amount of change to the FreeBSD nfsd,
> > so it shouldn't be that much work (I'm a lazy guy looking for a minimal
> > solution;-).
> >
> > Single Metadata server...
> > The big limitation to all of the above is the "single MDS" limitation.
> > I had thought this would be a serious limitation to the design scaling
> > up to large stores.
> > However, I'm not so sure it is a big limitation??
> > 1 - Since the files on the MDS are all empty, the file system is only
> >     i-nodes, directories and extended attribute blocks.
> >     As such, I hope it can be put on fast storage.
> > *** I don't know anything about current and near term future SSD
> > technologies.
> >     Hopefully others can suggest how large/fast a store for the MDS could
> >     be built easily?
> >     --> I am hoping that it will be possible to build an MDS that can
> > handle
> >         a lot of DS/storage this way?
> >     (If anyone has access to hardware and something like SpecNFS, they
> > could
> >      test an RPC load with almost no Read/Write RPCs and this would
> > probably
> >      show about what the metadata RPC limits are for one of these.)
> >
> 
> I think a single MDS can scale up to petabytes of storage easily. It
> remains to be seen how far it can scale for TPS. I will note that Google's
> GFS filesystem (you can find a paper describing it at
> http://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf)
> uses effectively a single MDS, replicated for redundancy but still serving
> just from one master MDS at a time. That filesystem scaled pretty well for
> both data size and transactions so I think the approach is viable.
> 
> 
> 
> >
> > 2 - Although it isn't quite having multiple MDSs, the directory tree could
> >     be split up with an MDS for each subtree. This would allow some scaling
> >     beyond one MDS.
> >     (Although not implemented in FreeBSD's NFSv4.1 yet, Referrals are
> > basically
> >      an NFS server driven "automount" that redirects the NFSv4.1 client to
> >      a different server for a subtree. This might be a useful tool for
> >      splitting off subtrees to different MDSs?)
> >
> > If you actually read this far, any comments on this would be welcome.
> > In particular, if you have an opinion w.r.t. this single MDS limitation
> > and/or how big an MDS could be built, that would be appreciated.
> >
> > Thanks for any comments, rick
> >
> 
> My back-of-envelope calculation assumed a 10 Pb filesystem containing
> mostly large files which would be striped in 10 Mb pieces. Guessing that we
> need 200 bytes of metadata per piece, that gives around 200 Gb of metadata
> which is very reasonable. Even for file sets containing much smaller files,
> a single server should have no trouble storing the metadata.
> 
Thanks for all the good comments, rick
ps: Good luck with your pNFS server. Maybe someday it will be available for FreeBSD?