pNFS server Plan B

Tue Jun 14 08:47:25 UTC 2016

As I mentioned to Rick, I have been working on similar lines to put
together a pNFS implementation. Comments embedded below.

On 13 June 2016 at 23:28, Rick Macklem <rmacklem at uoguelph.ca> wrote:

> You may have already heard of Plan A, which sort of worked
> and you could test by following the instructions here:
>
> http://people.freebsd.org/~rmacklem/pnfs-setup.txt
>
> However, it is very slow for metadata operations (everything other than
> read/write) and I don't think it is very useful.
>
> After my informal talk at BSDCan, here are some thoughts I have:
> - I think the slowness is related to latency w.r.t. all the messages
>   being passed between the nfsd, GlusterFS via Fuse and between the
>   GlusterFS daemons. As such, I don't think faster hardware is likely
>   to help a lot w.r.t. performance.
> - I have considered switching to MooseFS, but I would still be using Fuse.
>   *** MooseFS uses a centralized metadata store, which would imply only
>       a single Metadata Server (MDS) could be supported, I think?
>       (More on this later...)
> - dfr@ suggested that avoiding Fuse and doing everything in userspace
>   might help.
> - I thought of porting the nfsd to userland, but that would be quite a
>   bit of work, since it uses the kernel VFS/VOP interface, etc.
>

I ended up writing everything from scratch as userland code rather than
consider porting the kernel code. It was quite a bit of work :)

>
> All of the above has led me to Plan B.
> It would be limited to a single MDS, but as you'll see
> I'm not sure that is as large a limitation as I thought it would be.
> (If you aren't interested in details of this Plan B design, please
>  skip to "Single Metadata server..." for the issues.)
>
> Plan B:
> - Do it all in kernel using a slightly modified nfsd. (FreeBSD nfsd would
>   be used for both the MDS and Data Server (DS).)
> - One FreeBSD server running nfsd would be the MDS. It would
>   build a file system tree that looks exactly like it would without pNFS,
>   except that the files would be empty. (size == 0)
>   --> As such, all the current nfsd code would do metadata operations on
>       this file system exactly like the nfsd does now.
> - When a new file is created (an Open operation on NFSv4.1), the file would
>   be created exactly like it is now for the MDS.
>   - Then DS(s) would be selected and the MDS would do
>     a Create of a data storage file on these DS(s).
>     (This algorithm could become interesting later, but initially it would
>      probably just pick one DS at random or similar.)
>     - These file(s) would be in a single directory on the DS(s) and would
> have
>       a file name which is simply the File Handle for this file on the
>       MDS (an FH is 28bytes->48bytes of Hex in ASCII).
>

I have something similar but using a directory hierarchy to try to avoid
any one directory being excessively large.

>   - Extended attributes would be added to the Metadata file for:
>     - The data file's actual size.
>     - The DS(s) the data file in on.
>     - The File Handle for these data files on the DS(s).
>   This would add some overhead to the Open/create, which would be one
>   Create RPC for each DS the data file is on.
>

An alternative here would be to store the extra metadata in the file itself
rather than use extended attributes.

> *** Initially there would only be one file on one DS. Mirroring for
>     redundancy can be added later.
>

The scale of filesystem I want to build more or less requires the extra
redundancy of mirroring so I added this at the start. It does add quite a
bit of complexity to the MDS to keep track of which DS should have which
piece of data and to handle DS failures properly, re-silvering data etc.

>
> Now, the layout would be generated from these extended attributes for any
> NFSv4.1 client that asks for it.
>
> If I/O operations (read/write/setattr_of_size) are performed on the
> Metadata
> server, it would act as a proxy and do them on the DS using the extended
> attribute information (doing an RPC on the DS for the client).
>
> When the file is removed on the Metadata server (link cnt --> 0), the
> Metadata server would do Remove RPC(s) on the DS(s) for the data file(s).
> (This requires the file name, which is just the Metadata FH in ASCII.)
>

Currently I have a non-nfs control protocol for this but strictly speaking
it isn't necessary as you note.

>
> The only addition that the nfsd for the DS(s) would need would be a
> callback
> to the MDS done whenever a client (not the MDS) does
> a write to the file, notifying the Metadata server the file has been
> modified and is now Size=K, so the Metadata server can keep the attributes
> up to date for the file. (It can identify the file by the MDS FH.)
>

I don't think you need this - the client should perform LAYOUTCOMMIT rpcs
which will inform the MDS of the last write position and last modify time.
This can be used to update the file metadata. The Linux client does this
before the CLOSE rpc on the client as far as I can tell.

>
> All of this is a relatively small amount of change to the FreeBSD nfsd,
> so it shouldn't be that much work (I'm a lazy guy looking for a minimal
> solution;-).
>
> Single Metadata server...
> The big limitation to all of the above is the "single MDS" limitation.
> I had thought this would be a serious limitation to the design scaling
> up to large stores.
> However, I'm not so sure it is a big limitation??
> 1 - Since the files on the MDS are all empty, the file system is only
>     i-nodes, directories and extended attribute blocks.
>     As such, I hope it can be put on fast storage.
> *** I don't know anything about current and near term future SSD
> technologies.
>     Hopefully others can suggest how large/fast a store for the MDS could
>     be built easily?
>     --> I am hoping that it will be possible to build an MDS that can
> handle
>         a lot of DS/storage this way?
>     (If anyone has access to hardware and something like SpecNFS, they
> could
>      test an RPC load with almost no Read/Write RPCs and this would
> probably
>      show about what the metadata RPC limits are for one of these.)
>

I think a single MDS can scale up to petabytes of storage easily. It
remains to be seen how far it can scale for TPS. I will note that Google's
GFS filesystem (you can find a paper describing it at
http://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf)
uses effectively a single MDS, replicated for redundancy but still serving
just from one master MDS at a time. That filesystem scaled pretty well for
both data size and transactions so I think the approach is viable.

>
> 2 - Although it isn't quite having multiple MDSs, the directory tree could
>     be split up with an MDS for each subtree. This would allow some scaling
>     beyond one MDS.
>     (Although not implemented in FreeBSD's NFSv4.1 yet, Referrals are
> basically
>      an NFS server driven "automount" that redirects the NFSv4.1 client to
>      a different server for a subtree. This might be a useful tool for
>      splitting off subtrees to different MDSs?)
>
> If you actually read this far, any comments on this would be welcome.
> In particular, if you have an opinion w.r.t. this single MDS limitation
> and/or how big an MDS could be built, that would be appreciated.
>
> Thanks for any comments, rick
>

My back-of-envelope calculation assumed a 10 Pb filesystem containing
mostly large files which would be striped in 10 Mb pieces. Guessing that we
need 200 bytes of metadata per piece, that gives around 200 Gb of metadata
which is very reasonable. Even for file sets containing much smaller files,
a single server should have no trouble storing the metadata.