Re: RFC: NFS over RDMA

From: Rick Macklem <rick.macklem_at_gmail.com>
Date: Sat, 01 Nov 2025 21:03:59 UTC
On Sat, Nov 1, 2025 at 1:50 PM Konstantin Belousov <kib@freebsd.org> wrote:
>
> Added Slava Schwartsman.
>
> On Sat, Nov 01, 2025 at 01:11:02PM -0700, Rick Macklem wrote:
> > Hi,
> >
> > I've had NFS over RDMA on my todo list for a very loonnnggg
> > time. I've avoided it because I haven't had a way to test it,
> > but I'm now going to start working on it. (A bunch of this work
> > is already done for NFS-over-TLS which added code for handling
> > M_EXTPG mbufs.)
> >
> > >From RFC-8166, there appears to be 4 operations the krpc
> > needs to do:
> > send-rdma - Send on the payload stream (sending messages that
> >                     are kept in order).
> > recv-rdma - Receive the above.
> > ddp-write - Do a write of DDP data.
> > ddp-read - Do a read of DDP data.
> >
> > So, here is how I see the krpc doing this.
> > An NFS write RPC for example:
> > - The NFS client code packages the Write RPC XDR as follows:
> >   - 1 or more mbufs/mbuf_clusters of XDR for the NFS arguments
> >      that precede the write data.
> >   - an mbuf that indicates "start of ddp-read". (Maybe use M_PROTO1?)
> >   - 1 or more M_EXTPG mbugs with page(s) loaded with the data to be
> >     written.
> >   - 0 or more mbufs/mbuf_clusters with additional RPC request XDR.
> >
> > This would be passed to the krpc which would...
> >  - the mbufs up to "start of ddp" in the payload stream.
> >  - Would specify a ddp-read for the pages from the M_EXTPG mbufs
> >    and send that in the payload stream.
> >  - send the remaining mbufs/mbuf_clusters in the payload stream
> >
> > The NFS server end would process the received payload stream,
> > putting the non-ddp stuff in mbufs/mbuf_clusters.
> > It would do the ddp-read of the data into anonymous pages it allocates
> > and would associate these with M_EXTPG mbufs.
> > It would put any remaining payload stream stuff for the RPC message in
> > additional mbufs/mbuf_clusters.
> > --> Call the NFS server with the mbuf list for processing.
> >      - When the NFS server gets to the write data (in M_EXTPG mbufs)
> >        it would set up a uio/iovec for the pages and call VOP_WRITE().
> >
> > Now, the above is straightforward for me, since I know the NFS and
> > krpc code fairly well.
> > But that is where my expertise ends.
> >
> > So, what kind of calls do the drivers provide to send and receive
> > what RFC-8166 calls the payload stream?
> >
> > And what kind of calls do the drivers provide to write and read DDP
> > chunks?
> >
> > Also, if the above sounds way off the mark, please let me know.
>
> What you need is, most likely, the infiniband API or KPI to handle
> RDMA.  It is driver-independent, same as for ip NFS you use system IP
> stack and not call to ethernet drivers.  In fact, most likely the
> transport used would be not native IB, but IB over UDP (RoCE v2).
>
> IB verbs, which is the official interface for both kernel and user mode,
> are not well documented.  An overview is provided by the document
> titled "RDMA Aware Networks Programming User Manual", which should
> be google-able.  Otherwise, the Infiniband specication is the reference.
Thanks. I'll look at that. (I notice that the Intel code references something
they call Linux-OpenIB. Hopefully that looks about the same and the
glue needed to support non-Mellanox drivers isn't too difficult?)

Btw, if anyone is interested in taking a more active involvement in this,
they are more than welcome to do so. (I'm going to be starting where I
understand things in the krpc/nfs. I'm not looking forward to porting rxe,
but will probably end up there. I have already had one offer w.r.t. access
to a lab that includes Mellanox hardware, but I don't know if remote
debugging will be practical yet.)

rick

>
> The IB implementation for us is still called OFED for historical reasons,
> and it is located in sys/ofed.
>
> >
> > As for testing, I am planning on hacking away at one of the RDMA
> > in software drivers in Linux to get it working well enough to use for
> > testing. Whatever seems to be easiest to get kinda working.
> Yes rxe driver is the sw RoCE v2 implementation.  We looked at the
> amount of work to port it.  Its size is ~12 kLoC, which is compatible
> with libibverbs (userspace core infiniband interface).
>
> >
> > Anyhow, any comments would be appreciated, rick
> > ps: I did a bunch of cc's trying to get to the people that might know
> >       how the RDMA drivers work and what calls would do the above for
> >       them.
>