Re: RFC: NFS over RDMA

From: John Baldwin <jhb_at_FreeBSD.org>
Date: Mon, 03 Nov 2025 14:35:56 UTC
On 11/1/25 17:26, Rick Macklem wrote:
> On Sat, Nov 1, 2025 at 2:10 PM Konstantin Belousov <kib@freebsd.org> wrote:
>>
>> On Sat, Nov 01, 2025 at 02:03:59PM -0700, Rick Macklem wrote:
>>> On Sat, Nov 1, 2025 at 1:50 PM Konstantin Belousov <kib@freebsd.org> wrote:
>>>>
>>>> Added Slava Schwartsman.
>>>>
>>>> On Sat, Nov 01, 2025 at 01:11:02PM -0700, Rick Macklem wrote:
>>>>> Hi,
>>>>>
>>>>> I've had NFS over RDMA on my todo list for a very loonnnggg
>>>>> time. I've avoided it because I haven't had a way to test it,
>>>>> but I'm now going to start working on it. (A bunch of this work
>>>>> is already done for NFS-over-TLS which added code for handling
>>>>> M_EXTPG mbufs.)
>>>>>
>>>>> >From RFC-8166, there appears to be 4 operations the krpc
>>>>> needs to do:
>>>>> send-rdma - Send on the payload stream (sending messages that
>>>>>                      are kept in order).
>>>>> recv-rdma - Receive the above.
>>>>> ddp-write - Do a write of DDP data.
>>>>> ddp-read - Do a read of DDP data.
>>>>>
>>>>> So, here is how I see the krpc doing this.
>>>>> An NFS write RPC for example:
>>>>> - The NFS client code packages the Write RPC XDR as follows:
>>>>>    - 1 or more mbufs/mbuf_clusters of XDR for the NFS arguments
>>>>>       that precede the write data.
>>>>>    - an mbuf that indicates "start of ddp-read". (Maybe use M_PROTO1?)
>>>>>    - 1 or more M_EXTPG mbugs with page(s) loaded with the data to be
>>>>>      written.
>>>>>    - 0 or more mbufs/mbuf_clusters with additional RPC request XDR.
>>>>>
>>>>> This would be passed to the krpc which would...
>>>>>   - the mbufs up to "start of ddp" in the payload stream.
>>>>>   - Would specify a ddp-read for the pages from the M_EXTPG mbufs
>>>>>     and send that in the payload stream.
>>>>>   - send the remaining mbufs/mbuf_clusters in the payload stream
>>>>>
>>>>> The NFS server end would process the received payload stream,
>>>>> putting the non-ddp stuff in mbufs/mbuf_clusters.
>>>>> It would do the ddp-read of the data into anonymous pages it allocates
>>>>> and would associate these with M_EXTPG mbufs.
>>>>> It would put any remaining payload stream stuff for the RPC message in
>>>>> additional mbufs/mbuf_clusters.
>>>>> --> Call the NFS server with the mbuf list for processing.
>>>>>       - When the NFS server gets to the write data (in M_EXTPG mbufs)
>>>>>         it would set up a uio/iovec for the pages and call VOP_WRITE().
>>>>>
>>>>> Now, the above is straightforward for me, since I know the NFS and
>>>>> krpc code fairly well.
>>>>> But that is where my expertise ends.
>>>>>
>>>>> So, what kind of calls do the drivers provide to send and receive
>>>>> what RFC-8166 calls the payload stream?
>>>>>
>>>>> And what kind of calls do the drivers provide to write and read DDP
>>>>> chunks?
>>>>>
>>>>> Also, if the above sounds way off the mark, please let me know.
>>>>
>>>> What you need is, most likely, the infiniband API or KPI to handle
>>>> RDMA.  It is driver-independent, same as for ip NFS you use system IP
>>>> stack and not call to ethernet drivers.  In fact, most likely the
>>>> transport used would be not native IB, but IB over UDP (RoCE v2).
>>>>
>>>> IB verbs, which is the official interface for both kernel and user mode,
>>>> are not well documented.  An overview is provided by the document
>>>> titled "RDMA Aware Networks Programming User Manual", which should
>>>> be google-able.  Otherwise, the Infiniband specication is the reference.
>>> Thanks. I'll look at that. (I notice that the Intel code references something
>>> they call Linux-OpenIB. Hopefully that looks about the same and the
>>> glue needed to support non-Mellanox drivers isn't too difficult?)
>> OpenIB is perhaps the reference to the IB code in Linux kernel proper
>> plus userspace libraries from rdma-core.  This is what was forked/grown
>> from OFED.
>>
>> Intel put efforts into the iWARP, which is sort of alternative for RoCEv2.
>> It has RFCs and works over TCP AFAIR, which causes problems for it.
> Heh, heh. I'm trying to avoid the iWARP vs RoCE wars.;-)
> (I did see a Mellanox white paper with graphs showing how RoCE outperforms
> iWARP.)
> Intel currently claims to support RoCE on its 810 and 820 NICs.
> Broadcom also claims to support RoCE, but doesn't mention FreeBSD
> drivers and Chelsio does iWARP, afaik.
> 
> For some reason, at the last NFSv4 Bakeathon, Chuck was testing with
> iWARP and not RoCE? (I haven't asked Chuck why he chose that. It
> might just be more convenient to set up the siw driver in Linux vs the
> rxe one? He is the main author of RFC-8166, so he's the NFS-over-RDMA guy.)
> 
> But it does look like a fun project for the next year. (I recall jhb@ mentioning
> that NFS-over-TLS wouldn't be easy and it turned out to be a fun
> little project.)

Konstantin is right though that sys/ofed is Linux OpenIB and has an interface
that should let you do RDMA (both ROCEv2 and iWARP).  I'm hoping to use the APIs
in sys/ofed to support NVMe over RDMA (both ROCEv2 and iWARP) at some point as
well.
> rick
> 
>>
>>>
>>> Btw, if anyone is interested in taking a more active involvement in this,
>>> they are more than welcome to do so. (I'm going to be starting where I
>>> understand things in the krpc/nfs. I'm not looking forward to porting rxe,
>>> but will probably end up there. I have already had one offer w.r.t. access
>>> to a lab that includes Mellanox hardware, but I don't know if remote
>>> debugging will be practical yet.)
>>>
>>> rick
>>>
>>>>
>>>> The IB implementation for us is still called OFED for historical reasons,
>>>> and it is located in sys/ofed.
>>>>
>>>>>
>>>>> As for testing, I am planning on hacking away at one of the RDMA
>>>>> in software drivers in Linux to get it working well enough to use for
>>>>> testing. Whatever seems to be easiest to get kinda working.
>>>> Yes rxe driver is the sw RoCE v2 implementation.  We looked at the
>>>> amount of work to port it.  Its size is ~12 kLoC, which is compatible
>>>> with libibverbs (userspace core infiniband interface).

Interesting.  I'm currently working on merging back several OFED commits from
Linux to sys/ofed (currently I have about 30 commits merged, some older than
Hans' last merge, and some newer, some of the newer ones should permit removing
compat stubs for some of the newer APIs that are duplicated in bnxt, irdma, and
mlx*).  When I get a bit further along I'll post the branch I have for more
testing (it is a bunch of individual cherry-picks rather than a giant merge).

Porting over rxe could be useful for me as well for some work I am doing.

-- 
John Baldwin