RFC: using ceph as a backend for an NFSv4.1 pNFS server

Tue Apr 29 00:38:24 UTC 2014

Ivan Voras wrote:
> On 26/04/2014 21:47, Rick Macklem wrote:
> 
> > Any other comments w.r.t. this would be appreciated, including
> > generic stuff like "we couldn't care less about pNFS" or technical
> > details/opinions.
> > 
> > Thanks in advance for any feedback, rick
> > ps: I'm no where near committing to do this at this point and
> >     I do realize that even completing the ceph port to FreeBSD
> >     might be beyond my limited resources.
> 
> What functionality from ceph would pNFS really need? Would pNFS need
> to
> be implemented with a single back-end storage like ceph or could it
> be
> modular? (I don't have much experience here but it looks like HDFS is
> becoming popular for some big-data applications).
> 
> 
Well, I doubt I can answer this, but here is a simple summary of what
a pNFS server does:
- The NFSv4.1/pNFS server (sometimes called a metadata server of MDS)
  handles all the normal NFS stuff including read/writes of the files.
  However, it can also hand out layouts, which tell the client where
  to read/write the file on another data server (DS).
  - There are RFCs to describe 3 ways the client can read/write data
    on a DS.
  1 - File Layout, where the client uses a subset of NFSv4.1 (read/write +
      enough others to use them).
  2 - Block/volume, where the client uses iSCSI to read/write blocks for
      the file's data.
  3 - Object, where the object storage commands are used over iSCSI.
I think you can see that any of these require a lot of work to be done
"behind the curtains" so that the MDS server can know where the file's
data lives (and it can be striped across multiple DSs, etc).

To implement this "from the ground up" is way beyond my limited time/resources
(and expertise).

I hope that I can find an open source cluster file system that handles
most of the "behind the curtains" stuff so that all the NFSv4.1 server
needs to do is "ask the cluster file system where the file/object's data
lives" and generate a layout from that. (I'm basically looking for a
path of least work.;-) Exactly what is needed from the cluster fs
isn`t obvious to me at this time (and depends on layout type) but
here are some thoughts:
- where the file`s data lives and the info needed for the layout
  so the client can read and write the file`s data at the DS.
- when the file`s data location changes, so it can recall the stale
  layout
- allowing the file to grow without the MDS having to do anything,
  when the client writes to the DS (the MDS needs to have a way to
  find out the current size of the file)
- allow the DSs to be built easily, using FreeBSD and the cluster
  file system tools (ideally using underlying FreeBSD file systems
  like ZFS to avoid `yet another` file system)
There are probably a lot more of these.

My hunch is that doing this for even one cluster file system will be
at/beyond my time/resource limits. I also suspect these cluster file
systems are different enough that each would be a lot of effort,
even ignoring the fact that none of them are ported to FreeBSD.

I'd also like to avoid porting a file system into FreeBSD. What I
do like about ceph (and glustre is similar, I think?) is that they
are layered on top of a regular file system, so they can use ZFS
for the actual storage handling.

rick