pNFS server Plan B

Sat Jun 18 23:05:52 UTC 2016

Jordan Hubbard wrote:
> 
> > On Jun 13, 2016, at 3:28 PM, Rick Macklem <rmacklem at uoguelph.ca> wrote:
> > 
> > You may have already heard of Plan A, which sort of worked
> > and you could test by following the instructions here:
> > 
> > http://people.freebsd.org/~rmacklem/pnfs-setup.txt
> > 
> > However, it is very slow for metadata operations (everything other than
> > read/write) and I don't think it is very useful.
> 
I am going to respond to a few of the comments, but I hope that people who
actually run server farms and might be a user of a fairly large/inexpensive
storage cluster will comment.

Put another way, I'd really like to hear a "user" perspective.

> Hi guys,
> 
> I finally got a chance to catch up and bring up Rick’s pNFS setup on a couple
> of test machines.  He’s right, obviously - The “plan A” approach is a bit
> convoluted and not at all surprisingly slow.  With all of those transits
> twixt kernel and userland, not to mention glusterfs itself which has not
> really been tuned for our platform (there are a number of papers on this we
> probably haven’t even all read yet), we’re obviously still in the “first
> make it work” stage.
> 
> That said, I think there are probably more possible plans than just A and B
> here, and we should give the broader topic of “what does FreeBSD want to do
> in the Enterprise / Cloud computing space?" at least some consideration at
> the same time, since there are more than a few goals running in parallel
> here.
> 
> First, let’s talk about our story around clustered filesystems + associated
> command-and-control APIs in FreeBSD.  There is something of an embarrassment
> of riches in the industry at the moment - glusterfs, ceph, Hadoop HDFS,
> RiakCS, moose, etc.  All or most of them offer different pros and cons, and
> all offer more than just the ability to store files and scale “elastically”.
> They also have ReST APIs for configuring and monitoring the health of the
> cluster, some offer object as well as file storage, and Riak offers a
> distributed KVS for storing information *about* file objects in addition to
> the object themselves (and when your application involves storing and
> managing several million photos, for example, the idea of distributing the
> index as well as the files in a fault-tolerant fashion is also compelling).
> Some, if not most, of them are also far better supported under Linux than
> FreeBSD (I don’t think we even have a working ceph port yet).   I’m not
> saying we need to blindly follow the herds and do all the same things others
> are doing here, either, I’m just saying that it’s a much bigger problem
> space than simply “parallelizing NFS” and if we can kill multiple birds with
> one stone on the way to doing that, we should certainly consider doing so.
> 
> Why?  Because pNFS was first introduced as a draft RFC (RFC5661
> <https://datatracker.ietf.org/doc/rfc5661/>) in 2005.  The linux folks have
> been working on it
> <http://events.linuxfoundation.org/sites/events/files/slides/pnfs.pdf> since
> 2006.  Ten years is a long time in this business, and when I raised the
> topic of pNFS at the recent SNIA DSI conference (where storage developers
> gather to talk about trends and things), the most prevalent reaction I got
> was “people are still using pNFS?!”
Actually, I would have worded this as "will anyone ever use pNFS?".

Although 10 years is a long time in this business, it doesn't seem to be long
at all in the standards world where the NFSv4 protocols are being developed.
- You note that the Linux folk started development in 2006.
  I will note that RFC5661 (the RFC that describes pNFS) is dated 2010.
  I will also note that I believe the first vendor to ship a server that supported pNFS
  happened sometime after the RFC was published.
  - I could be wrong, but I'd guess that Netapp's clustered Filers were the
    first to ship, about 4 years ago.

To this date, very few vendors have actually shipped working pNFS servers
as far as I am aware. Other than Netapp, the only one I know that has shipped
are the large EMC servers (not Isilon).
I am not sure if Oracle/Solaris has ever shipped a pNFS server to customers yet.
Same goes for Panasas. I am not aware of a Linux based pNFS server usable in
a production environment, although Ganesha-NFS might be shipping with pNFS support now.
- If others are aware of other pNFS servers that are shipping to customers,
  please correct me. (I haven't been to a NFSv4.1 testing event for 3 years,
  so my info is definitely dated.)

Note that the "Flex Files" layout I used for the Plan A experiment is only an
Internet draft at this time and hasn't even made it to the RFC stage.

--> As such, I think it is very much an open question w.r.t. whether or not
    this protocol will become widely used or yet another forgotten standard?
    I also suspect that some storage vendors that have invested considerable
    resources in NFSv4.1/pNFS development might ask the same question in-house;-)

>   This is clearly one of those
> technologies that may still have some runway left, but it’s been rapidly
> overtaken by other approaches to solving more or less the same problems in
> coherent, distributed filesystem access and if we want to get mindshare for
> this, we should at least have an answer ready for the “why did you guys do
> pNFS that way rather than just shimming it on top of ${someNewerHotness}??”
> argument.   I’m not suggesting pNFS is dead - hell, even AFS
> <https://www.openafs.org/> still appears to be somewhat alive, but there’s a
> difference between appealing to an increasingly narrow niche and trying to
> solve the sorts of problems most DevOps folks working At Scale these days
> are running into.
> 
> That is also why I am not sure I would totally embrace the idea of a central
> MDS being a Real Option.  Sure, the risks can be mitigated (as you say, by
> mirroring it), but even saying the words “central MDS” (or central anything)
> may be such a turn-off to those very same DevOps folks, folks who have been
> burned so many times by SPOFs and scaling bottlenecks in large environments,
> that we'll lose the audience the minute they hear the trigger phrase.  Even
> if it means signing up for Other Problems later, it’s a lot easier to “sell”
> the concept of completely distributed mechanisms where, if there is any
> notion of centralization at all, it’s at least the result of a quorum
> election and the DevOps folks don’t have to do anything manually to cause it
> to happen - the cluster is “resilient" and "self-healing" and they are happy
> with being able to say those buzzwords to the CIO, who nods knowingly and
> tells them they’re doing a fine job!
> 
I'll admit that I'm a bits and bytes guy. I have a hunch how difficult it is
to get "resilient" and "self-healing" to really work. I also know it is way
beyond what I am capable of.

> Let’s get back, however, to the notion of downing multiple avians with the
> same semi-spherical kinetic projectile:  What seems to be The Rage at the
> moment, and I don’t know how well it actually scales since I’ve yet to be at
> the pointy end of such a real-world deployment, is the idea of clustering
> the storage (“somehow”) underneath and then providing NFS and SMB protocol
> access entirely in userland, usually with both of those services cooperating
> with the same lock manager and even the same ACL translation layer.  Our
> buddies at Red Hat do this with glusterfs at the bottom and NFS Ganesha +
> Samba on top - I talked to one of the Samba core team guys at SNIA and he
> indicated that this was increasingly common, with the team having helped
> here and there when approached by different vendors with the same idea.   We
> (iXsystems) also get a lot of requests to be able to make the same file(s)
> available via both NFS and SMB at the same time and they don’t much at all
> like being told “but that’s dangerous - don’t do that!  Your file contents
> and permissions models are not guaranteed to survive such an experience!”
> They really want to do it, because the rest of the world lives in
> Heterogenous environments and that’s just the way it is.
> 
If you want to make SMB and NFS work to-gether on the same uderlying file systems,
I suspect it is doable, although messy. To do this with the current FreeBSD nfsd,
it would require someone with Samba/Windows knowledge pointing out what Samba
needs to interact with NFSv4 and those hooks could probably be implemented.
(I know nothing about Samba/Windows, so I'd need someone else doing that side
 of it.)

I actually mentioned Ganesha-NFS at the little talk/discussion I gave.
At this time, they have ripped a FreeBSD port out of their sources and they
use Linux specific thread primitives.
--> It would probably be significant work to get Ganesha-NFS up to speed on
    FreeBSD. Maybe a good project, but it needs some person/group dedicating
    resources to get it to happen.

> Even the object storage folks, like Openstack’s Swift project, are spending
> significant amounts of mental energy on the topic of how to re-export their
> object stores as shared filesystems over NFS and SMB, the single consistent
> and distributed object store being, of course, Their Thing.  They wish, of
> course, that the rest of the world would just fall into line and use their
> object system for everything, but they also get that the "legacy stuff” just
> won’t go away and needs some sort of attention if they’re to remain players
> at the standards table.
> 
> So anyway, that’s the view I have from the perspective of someone who
> actually sells storage solutions for a living, and while I could certainly
> “sell some pNFS” to various customers who just want to add a dash of
> steroids to their current NFS infrastructure, or need to use NFS but also
> need to store far more data into a single namespace than any one box will
> accommodate, I also know that offering even more elastic solutions will be a
> necessary part of offering solutions to the growing contingent of folks who
> are not tied to any existing storage infrastructure and have various
> non-greybearded folks shouting in their ears about object this and cloud
> that.  Might there not be some compromise solution which allows us to put
> more of this in userland with less context switches in and out of the
> kernel, also giving us the option of presenting a more united front to
> multiple protocols that require more ACL and lock impedance-matching than
> we’d ever want to put in the kernel anyway?
> 
For SMB + NFS in userland, the combination of Samba and Ganesha is probably
your main open source choice, from what I am aware of.

I am one guy who does this as a spare time retirement hobby. As such, doing
something like a Ganesha port etc is probably beyond what I am interested in.
When saying this, I don't want to imply that it isn't a good approach.

You sent me the URL for an abstract for a paper discussing how Facebook is
using GlusterFS. It would be nice to get more details w.r.t. how they use it,
such as:
- How do their client servers access it? (NFS, Fuse, or ???)
- Whether or not they've tried the Ganesha-NFS stuff that GlusterFS is
  transitioning to?
Put another way, they might have some insight into whether the NFS is userland
via Ganesha works well or not?

Hopefully some "users" for this stuff will respond, rick
ps: Maybe this could be reposted in a place they are likely to read it.

> - Jordan
> 
> 
> 
>