pNFS server Plan B

Sun Jun 19 23:29:26 UTC 2016

Jordan Hubbard wrote:
> 
> > On Jun 18, 2016, at 6:14 PM, Chris Watson <bsdunix44 at gmail.com> wrote:
> > 
> > Since Jordan brought up clustering, I would be interested to hear Justin
> > Gibbs thoughts here. I know about a year ago he was asked on an "after
> > hours" video chat hosted by Matt Aherns about a feature he would really
> > like to see and he mentioned he would really like, in a universe filled
> > with time and money I'm sure, to work on a native clustering solution for
> > FreeBSD. I don't know if he is subscribed to the list, and I'm certainly
> > not throwing him under the bus by bringing his name up, but I know he has
> > at least been thinking about this for some time and probably has some
> > value to add here.
> 
> I think we should also be careful to define our terms in such a discussion.
> Specifically:
> 
> 1. Are we talking about block-level clustering underneath ZFS (e.g. HAST or
> ${somethingElse}) or otherwise incorporated into ZFS itself at some low
> level?  If you Google for “High-availability ZFS” you will encounter things
> like RSF-1 or the somewhat more mysterious Zetavault
> (http://www.zeta.systems/zetavault/high-availability/) but it’s not entirely
> clear how these technologies work, they simply claim to “scale-out ZFS” or
> “cluster ZFS” (which can be done within ZFS or one level above and still
> probably pass the Marketing Test for what people are willing to put on a web
> page).
> 
> 2. Are we talking about clustering at a slightly higher level, in a
> filesystem-agnostic fashion which still preserves filesystem semantics?
> 
> 3. Are we talking about clustering for data objects, in a fashion which does
> not necessarily provide filesystem semantics (a sharding database which can
> store arbitrary BLOBs would qualify)?
> 
For the pNFS use case I am looking at, I would say #2.

I suspect #1 sits at a low enough level that redirecting I/O via the pNFS layouts
isn't practical, since ZFS is taking care of block allocations, etc.

I see #3 as a separate problem space, since NFS deals with files and not objects.
However, GlusterFS maps file objects on top of the POSIX-like FS, so I suppose that
could be done at the client end. (What glusterfs.org calls SwiftonFile, I think?)
It is also possible to map POSIX files onto file objects, but that sounds like more
work, which would need to be done under the NFS service.

> For all of the above:  Are we seeking to be compatible with any other
> mechanisms, or are we talking about a FreeBSD-only solution?
> 
> This is why I brought up glusterfs / ceph / RiakCS in my previous comments -
> when talking to the $users that Rick wants to involve in the discussion,
> they rarely come to the table asking for “some or any sort of clustering,
> don’t care which or how it works” - they ask if I can offer an S3 compatible
> object store with horizontal scaling, or 

> if they can use NFS in some
> clustered fashion where there’s a single namespace offering petabytes of
> storage with configurable redundancy such that no portion of that namespace
> is ever unavailable.
> 
I tend to think of this last case as the target for any pNFS server. The basic
idea is to redirect the I/O operations to wherever the data is actually stored,
so that I/O performance doesn't degrade with scale.

If redundancy is a necessary feature, then maybe Plan A is preferable to Plan B,
since GlusterFS does provide for redundancy and resilvering of lost copies, at
least from my understanding of the docs on gluster.org.

I'd also like to see how GlusterFS performs on a typical Linux setup.
Even without having the nfsd use FUSE, access of GlusterFS via FUSE results in crossing
user (syscall on mount) --> kernel --> user (glusterfs daemon) within the client machine,
if I understand how GlusterFS works. Then the gluster brick server glusterfsd daemon does
file system syscall(s) to get at the actual file on the underlying FS (xfs or ZFS or ...).
As such, there is already a lot of user<->kernel boundary crossings.
I wonder how much delay is added by the extra nfsd step for metadata?
- I can't say much about performance of Plan A yet, but metadata operations are slow
  and latency seems to be the issue. (I actually seem to get better performance by
  disabling SMP, for example.)

> I’d be interested in what Justin had in mind when he asked Matt about this.
> Being able to “attach ZFS pools to one another” in such a fashion that all
> clients just see One Big Pool and ZFS’s own redundancy / snapshotting
> characteristics magically apply to the überpool would be Pretty Cool,
> obviously, and would allow one to do round-robin DNS for NFS such that any
> node could serve the same contents, but that also sounds pretty ambitious,
> depending on how it’s implemented.
> 
This would probably work with the extant nfsd and wouldn't have a use for pNFS.
I also agree that this sounds pretty ambitious.

rick

> - Jordan
> 
>