pNFS server Plan B

Mon Jun 13 22:28:54 UTC 2016

You may have already heard of Plan A, which sort of worked
and you could test by following the instructions here:

http://people.freebsd.org/~rmacklem/pnfs-setup.txt

However, it is very slow for metadata operations (everything other than
read/write) and I don't think it is very useful.

After my informal talk at BSDCan, here are some thoughts I have:
- I think the slowness is related to latency w.r.t. all the messages
  being passed between the nfsd, GlusterFS via Fuse and between the
  GlusterFS daemons. As such, I don't think faster hardware is likely
  to help a lot w.r.t. performance.
- I have considered switching to MooseFS, but I would still be using Fuse.
  *** MooseFS uses a centralized metadata store, which would imply only
      a single Metadata Server (MDS) could be supported, I think?
      (More on this later...)
- dfr@ suggested that avoiding Fuse and doing everything in userspace
  might help.
- I thought of porting the nfsd to userland, but that would be quite a
  bit of work, since it uses the kernel VFS/VOP interface, etc.

All of the above has led me to Plan B.
It would be limited to a single MDS, but as you'll see
I'm not sure that is as large a limitation as I thought it would be.
(If you aren't interested in details of this Plan B design, please
 skip to "Single Metadata server..." for the issues.)

Plan B:
- Do it all in kernel using a slightly modified nfsd. (FreeBSD nfsd would
  be used for both the MDS and Data Server (DS).)
- One FreeBSD server running nfsd would be the MDS. It would
  build a file system tree that looks exactly like it would without pNFS,
  except that the files would be empty. (size == 0)
  --> As such, all the current nfsd code would do metadata operations on
      this file system exactly like the nfsd does now.
- When a new file is created (an Open operation on NFSv4.1), the file would
  be created exactly like it is now for the MDS.
  - Then DS(s) would be selected and the MDS would do
    a Create of a data storage file on these DS(s).
    (This algorithm could become interesting later, but initially it would
     probably just pick one DS at random or similar.)
    - These file(s) would be in a single directory on the DS(s) and would have
      a file name which is simply the File Handle for this file on the
      MDS (an FH is 28bytes->48bytes of Hex in ASCII).
  - Extended attributes would be added to the Metadata file for:
    - The data file's actual size.
    - The DS(s) the data file in on.
    - The File Handle for these data files on the DS(s).
  This would add some overhead to the Open/create, which would be one
  Create RPC for each DS the data file is on.
*** Initially there would only be one file on one DS. Mirroring for
    redundancy can be added later.

Now, the layout would be generated from these extended attributes for any
NFSv4.1 client that asks for it.

If I/O operations (read/write/setattr_of_size) are performed on the Metadata
server, it would act as a proxy and do them on the DS using the extended
attribute information (doing an RPC on the DS for the client).

When the file is removed on the Metadata server (link cnt --> 0), the
Metadata server would do Remove RPC(s) on the DS(s) for the data file(s).
(This requires the file name, which is just the Metadata FH in ASCII.)

The only addition that the nfsd for the DS(s) would need would be a callback
to the MDS done whenever a client (not the MDS) does
a write to the file, notifying the Metadata server the file has been
modified and is now Size=K, so the Metadata server can keep the attributes
up to date for the file. (It can identify the file by the MDS FH.)

All of this is a relatively small amount of change to the FreeBSD nfsd,
so it shouldn't be that much work (I'm a lazy guy looking for a minimal
solution;-).

Single Metadata server...
The big limitation to all of the above is the "single MDS" limitation.
I had thought this would be a serious limitation to the design scaling
up to large stores.
However, I'm not so sure it is a big limitation??
1 - Since the files on the MDS are all empty, the file system is only
    i-nodes, directories and extended attribute blocks.
    As such, I hope it can be put on fast storage.
*** I don't know anything about current and near term future SSD technologies.
    Hopefully others can suggest how large/fast a store for the MDS could
    be built easily?
    --> I am hoping that it will be possible to build an MDS that can handle
        a lot of DS/storage this way?
    (If anyone has access to hardware and something like SpecNFS, they could
     test an RPC load with almost no Read/Write RPCs and this would probably
     show about what the metadata RPC limits are for one of these.)

2 - Although it isn't quite having multiple MDSs, the directory tree could
    be split up with an MDS for each subtree. This would allow some scaling
    beyond one MDS.
    (Although not implemented in FreeBSD's NFSv4.1 yet, Referrals are basically
     an NFS server driven "automount" that redirects the NFSv4.1 client to
     a different server for a subtree. This might be a useful tool for
     splitting off subtrees to different MDSs?)

If you actually read this far, any comments on this would be welcome.
In particular, if you have an opinion w.r.t. this single MDS limitation
and/or how big an MDS could be built, that would be appreciated.

Thanks for any comments, rick