Re: RFC: NFS over RDMA

From: Rick Macklem <rick.macklem_at_gmail.com>
Date: Tue, 04 Nov 2025 06:10:46 UTC
On Mon, Nov 3, 2025 at 6:35 AM John Baldwin <jhb@freebsd.org> wrote:
>
> On 11/1/25 17:26, Rick Macklem wrote:
> > On Sat, Nov 1, 2025 at 2:10 PM Konstantin Belousov <kib@freebsd.org> wrote:
> >>
> >> On Sat, Nov 01, 2025 at 02:03:59PM -0700, Rick Macklem wrote:
> >>> On Sat, Nov 1, 2025 at 1:50 PM Konstantin Belousov <kib@freebsd.org> wrote:
> >>>>
> >>>> Added Slava Schwartsman.
> >>>>
> >>>> On Sat, Nov 01, 2025 at 01:11:02PM -0700, Rick Macklem wrote:
> >>>>> Hi,
> >>>>>
> >>>>> I've had NFS over RDMA on my todo list for a very loonnnggg
> >>>>> time. I've avoided it because I haven't had a way to test it,
> >>>>> but I'm now going to start working on it. (A bunch of this work
> >>>>> is already done for NFS-over-TLS which added code for handling
> >>>>> M_EXTPG mbufs.)
> >>>>>
> >>>>> >From RFC-8166, there appears to be 4 operations the krpc
> >>>>> needs to do:
> >>>>> send-rdma - Send on the payload stream (sending messages that
> >>>>>                      are kept in order).
> >>>>> recv-rdma - Receive the above.
> >>>>> ddp-write - Do a write of DDP data.
> >>>>> ddp-read - Do a read of DDP data.
> >>>>>
> >>>>> So, here is how I see the krpc doing this.
> >>>>> An NFS write RPC for example:
> >>>>> - The NFS client code packages the Write RPC XDR as follows:
> >>>>>    - 1 or more mbufs/mbuf_clusters of XDR for the NFS arguments
> >>>>>       that precede the write data.
> >>>>>    - an mbuf that indicates "start of ddp-read". (Maybe use M_PROTO1?)
> >>>>>    - 1 or more M_EXTPG mbugs with page(s) loaded with the data to be
> >>>>>      written.
> >>>>>    - 0 or more mbufs/mbuf_clusters with additional RPC request XDR.
> >>>>>
> >>>>> This would be passed to the krpc which would...
> >>>>>   - the mbufs up to "start of ddp" in the payload stream.
> >>>>>   - Would specify a ddp-read for the pages from the M_EXTPG mbufs
> >>>>>     and send that in the payload stream.
> >>>>>   - send the remaining mbufs/mbuf_clusters in the payload stream
> >>>>>
> >>>>> The NFS server end would process the received payload stream,
> >>>>> putting the non-ddp stuff in mbufs/mbuf_clusters.
> >>>>> It would do the ddp-read of the data into anonymous pages it allocates
> >>>>> and would associate these with M_EXTPG mbufs.
> >>>>> It would put any remaining payload stream stuff for the RPC message in
> >>>>> additional mbufs/mbuf_clusters.
> >>>>> --> Call the NFS server with the mbuf list for processing.
> >>>>>       - When the NFS server gets to the write data (in M_EXTPG mbufs)
> >>>>>         it would set up a uio/iovec for the pages and call VOP_WRITE().
> >>>>>
> >>>>> Now, the above is straightforward for me, since I know the NFS and
> >>>>> krpc code fairly well.
> >>>>> But that is where my expertise ends.
> >>>>>
> >>>>> So, what kind of calls do the drivers provide to send and receive
> >>>>> what RFC-8166 calls the payload stream?
> >>>>>
> >>>>> And what kind of calls do the drivers provide to write and read DDP
> >>>>> chunks?
> >>>>>
> >>>>> Also, if the above sounds way off the mark, please let me know.
> >>>>
> >>>> What you need is, most likely, the infiniband API or KPI to handle
> >>>> RDMA.  It is driver-independent, same as for ip NFS you use system IP
> >>>> stack and not call to ethernet drivers.  In fact, most likely the
> >>>> transport used would be not native IB, but IB over UDP (RoCE v2).
> >>>>
> >>>> IB verbs, which is the official interface for both kernel and user mode,
> >>>> are not well documented.  An overview is provided by the document
> >>>> titled "RDMA Aware Networks Programming User Manual", which should
> >>>> be google-able.  Otherwise, the Infiniband specication is the reference.
> >>> Thanks. I'll look at that. (I notice that the Intel code references something
> >>> they call Linux-OpenIB. Hopefully that looks about the same and the
> >>> glue needed to support non-Mellanox drivers isn't too difficult?)
> >> OpenIB is perhaps the reference to the IB code in Linux kernel proper
> >> plus userspace libraries from rdma-core.  This is what was forked/grown
> >> from OFED.
> >>
> >> Intel put efforts into the iWARP, which is sort of alternative for RoCEv2.
> >> It has RFCs and works over TCP AFAIR, which causes problems for it.
> > Heh, heh. I'm trying to avoid the iWARP vs RoCE wars.;-)
> > (I did see a Mellanox white paper with graphs showing how RoCE outperforms
> > iWARP.)
> > Intel currently claims to support RoCE on its 810 and 820 NICs.
> > Broadcom also claims to support RoCE, but doesn't mention FreeBSD
> > drivers and Chelsio does iWARP, afaik.
> >
> > For some reason, at the last NFSv4 Bakeathon, Chuck was testing with
> > iWARP and not RoCE? (I haven't asked Chuck why he chose that. It
> > might just be more convenient to set up the siw driver in Linux vs the
> > rxe one? He is the main author of RFC-8166, so he's the NFS-over-RDMA guy.)
> >
> > But it does look like a fun project for the next year. (I recall jhb@ mentioning
> > that NFS-over-TLS wouldn't be easy and it turned out to be a fun
> > little project.)
>
> Konstantin is right though that sys/ofed is Linux OpenIB and has an interface
> that should let you do RDMA (both ROCEv2 and iWARP).  I'm hoping to use the APIs
> in sys/ofed to support NVMe over RDMA (both ROCEv2 and iWARP) at some point as
> well.
> > rick
> >
> >>
> >>>
> >>> Btw, if anyone is interested in taking a more active involvement in this,
> >>> they are more than welcome to do so. (I'm going to be starting where I
> >>> understand things in the krpc/nfs. I'm not looking forward to porting rxe,
> >>> but will probably end up there. I have already had one offer w.r.t. access
> >>> to a lab that includes Mellanox hardware, but I don't know if remote
> >>> debugging will be practical yet.)
> >>>
> >>> rick
> >>>
> >>>>
> >>>> The IB implementation for us is still called OFED for historical reasons,
> >>>> and it is located in sys/ofed.
> >>>>
> >>>>>
> >>>>> As for testing, I am planning on hacking away at one of the RDMA
> >>>>> in software drivers in Linux to get it working well enough to use for
> >>>>> testing. Whatever seems to be easiest to get kinda working.
> >>>> Yes rxe driver is the sw RoCE v2 implementation.  We looked at the
> >>>> amount of work to port it.  Its size is ~12 kLoC, which is compatible
> >>>> with libibverbs (userspace core infiniband interface).
>
> Interesting.  I'm currently working on merging back several OFED commits from
> Linux to sys/ofed (currently I have about 30 commits merged, some older than
> Hans' last merge, and some newer, some of the newer ones should permit removing
> compat stubs for some of the newer APIs that are duplicated in bnxt, irdma, and
> mlx*).  When I get a bit further along I'll post the branch I have for more
> testing (it is a bunch of individual cherry-picks rather than a giant merge).
>
> Porting over rxe could be useful for me as well for some work I am doing.
I have https://github.com/rmacklem/freebsd-rdma. For now, I'll only be doing
commits to it for the NFS and krpc files.  It will be a while before anything in
it is useful for others.

I'll email when I get into the rxe port. (If you hurry, you can beat me to it;-)

Others are welcome to push/pull on the above. (Email if you need permissions
changes. I know diddly about github.)

rick

>
> --
> John Baldwin
>