Re: RFC: NFS over RDMA

From: Konstantin Belousov <kib_at_freebsd.org>
Date: Sat, 01 Nov 2025 21:09:51 UTC
On Sat, Nov 01, 2025 at 02:03:59PM -0700, Rick Macklem wrote:
> On Sat, Nov 1, 2025 at 1:50 PM Konstantin Belousov <kib@freebsd.org> wrote:
> >
> > Added Slava Schwartsman.
> >
> > On Sat, Nov 01, 2025 at 01:11:02PM -0700, Rick Macklem wrote:
> > > Hi,
> > >
> > > I've had NFS over RDMA on my todo list for a very loonnnggg
> > > time. I've avoided it because I haven't had a way to test it,
> > > but I'm now going to start working on it. (A bunch of this work
> > > is already done for NFS-over-TLS which added code for handling
> > > M_EXTPG mbufs.)
> > >
> > > >From RFC-8166, there appears to be 4 operations the krpc
> > > needs to do:
> > > send-rdma - Send on the payload stream (sending messages that
> > >                     are kept in order).
> > > recv-rdma - Receive the above.
> > > ddp-write - Do a write of DDP data.
> > > ddp-read - Do a read of DDP data.
> > >
> > > So, here is how I see the krpc doing this.
> > > An NFS write RPC for example:
> > > - The NFS client code packages the Write RPC XDR as follows:
> > >   - 1 or more mbufs/mbuf_clusters of XDR for the NFS arguments
> > >      that precede the write data.
> > >   - an mbuf that indicates "start of ddp-read". (Maybe use M_PROTO1?)
> > >   - 1 or more M_EXTPG mbugs with page(s) loaded with the data to be
> > >     written.
> > >   - 0 or more mbufs/mbuf_clusters with additional RPC request XDR.
> > >
> > > This would be passed to the krpc which would...
> > >  - the mbufs up to "start of ddp" in the payload stream.
> > >  - Would specify a ddp-read for the pages from the M_EXTPG mbufs
> > >    and send that in the payload stream.
> > >  - send the remaining mbufs/mbuf_clusters in the payload stream
> > >
> > > The NFS server end would process the received payload stream,
> > > putting the non-ddp stuff in mbufs/mbuf_clusters.
> > > It would do the ddp-read of the data into anonymous pages it allocates
> > > and would associate these with M_EXTPG mbufs.
> > > It would put any remaining payload stream stuff for the RPC message in
> > > additional mbufs/mbuf_clusters.
> > > --> Call the NFS server with the mbuf list for processing.
> > >      - When the NFS server gets to the write data (in M_EXTPG mbufs)
> > >        it would set up a uio/iovec for the pages and call VOP_WRITE().
> > >
> > > Now, the above is straightforward for me, since I know the NFS and
> > > krpc code fairly well.
> > > But that is where my expertise ends.
> > >
> > > So, what kind of calls do the drivers provide to send and receive
> > > what RFC-8166 calls the payload stream?
> > >
> > > And what kind of calls do the drivers provide to write and read DDP
> > > chunks?
> > >
> > > Also, if the above sounds way off the mark, please let me know.
> >
> > What you need is, most likely, the infiniband API or KPI to handle
> > RDMA.  It is driver-independent, same as for ip NFS you use system IP
> > stack and not call to ethernet drivers.  In fact, most likely the
> > transport used would be not native IB, but IB over UDP (RoCE v2).
> >
> > IB verbs, which is the official interface for both kernel and user mode,
> > are not well documented.  An overview is provided by the document
> > titled "RDMA Aware Networks Programming User Manual", which should
> > be google-able.  Otherwise, the Infiniband specication is the reference.
> Thanks. I'll look at that. (I notice that the Intel code references something
> they call Linux-OpenIB. Hopefully that looks about the same and the
> glue needed to support non-Mellanox drivers isn't too difficult?)
OpenIB is perhaps the reference to the IB code in Linux kernel proper
plus userspace libraries from rdma-core.  This is what was forked/grown
from OFED.

Intel put efforts into the iWARP, which is sort of alternative for RoCEv2.
It has RFCs and works over TCP AFAIR, which causes problems for it.

> 
> Btw, if anyone is interested in taking a more active involvement in this,
> they are more than welcome to do so. (I'm going to be starting where I
> understand things in the krpc/nfs. I'm not looking forward to porting rxe,
> but will probably end up there. I have already had one offer w.r.t. access
> to a lab that includes Mellanox hardware, but I don't know if remote
> debugging will be practical yet.)
> 
> rick
> 
> >
> > The IB implementation for us is still called OFED for historical reasons,
> > and it is located in sys/ofed.
> >
> > >
> > > As for testing, I am planning on hacking away at one of the RDMA
> > > in software drivers in Linux to get it working well enough to use for
> > > testing. Whatever seems to be easiest to get kinda working.
> > Yes rxe driver is the sw RoCE v2 implementation.  We looked at the
> > amount of work to port it.  Its size is ~12 kLoC, which is compatible
> > with libibverbs (userspace core infiniband interface).
> >
> > >
> > > Anyhow, any comments would be appreciated, rick
> > > ps: I did a bunch of cc's trying to get to the people that might know
> > >       how the RDMA drivers work and what calls would do the above for
> > >       them.
> >