Re: RFC: NFS over RDMA

From: Rick Macklem <rick.macklem_at_gmail.com>
Date: Wed, 05 Nov 2025 15:52:37 UTC
On Wed, Nov 5, 2025 at 7:47 AM Rick Macklem <rick.macklem@gmail.com> wrote:
>
> On Mon, Nov 3, 2025 at 10:10 PM Rick Macklem <rick.macklem@gmail.com> wrote:
> >
> > On Mon, Nov 3, 2025 at 6:35 AM John Baldwin <jhb@freebsd.org> wrote:
> > >
> > > On 11/1/25 17:26, Rick Macklem wrote:
> > > > On Sat, Nov 1, 2025 at 2:10 PM Konstantin Belousov <kib@freebsd.org> wrote:
> > > >>
> > > >> On Sat, Nov 01, 2025 at 02:03:59PM -0700, Rick Macklem wrote:
> > > >>> On Sat, Nov 1, 2025 at 1:50 PM Konstantin Belousov <kib@freebsd.org> wrote:
> > > >>>>
> > > >>>> Added Slava Schwartsman.
> > > >>>>
> > > >>>> On Sat, Nov 01, 2025 at 01:11:02PM -0700, Rick Macklem wrote:
> > > >>>>> Hi,
> > > >>>>>
> > > >>>>> I've had NFS over RDMA on my todo list for a very loonnnggg
> > > >>>>> time. I've avoided it because I haven't had a way to test it,
> > > >>>>> but I'm now going to start working on it. (A bunch of this work
> > > >>>>> is already done for NFS-over-TLS which added code for handling
> > > >>>>> M_EXTPG mbufs.)
> > > >>>>>
> > > >>>>> >From RFC-8166, there appears to be 4 operations the krpc
> > > >>>>> needs to do:
> > > >>>>> send-rdma - Send on the payload stream (sending messages that
> > > >>>>>                      are kept in order).
> > > >>>>> recv-rdma - Receive the above.
> > > >>>>> ddp-write - Do a write of DDP data.
> > > >>>>> ddp-read - Do a read of DDP data.
> > > >>>>>
> > > >>>>> So, here is how I see the krpc doing this.
> > > >>>>> An NFS write RPC for example:
> > > >>>>> - The NFS client code packages the Write RPC XDR as follows:
> > > >>>>>    - 1 or more mbufs/mbuf_clusters of XDR for the NFS arguments
> > > >>>>>       that precede the write data.
> > > >>>>>    - an mbuf that indicates "start of ddp-read". (Maybe use M_PROTO1?)
> > > >>>>>    - 1 or more M_EXTPG mbugs with page(s) loaded with the data to be
> > > >>>>>      written.
> > > >>>>>    - 0 or more mbufs/mbuf_clusters with additional RPC request XDR.
> > > >>>>>
> > > >>>>> This would be passed to the krpc which would...
> > > >>>>>   - the mbufs up to "start of ddp" in the payload stream.
> > > >>>>>   - Would specify a ddp-read for the pages from the M_EXTPG mbufs
> > > >>>>>     and send that in the payload stream.
> > > >>>>>   - send the remaining mbufs/mbuf_clusters in the payload stream
> > > >>>>>
> > > >>>>> The NFS server end would process the received payload stream,
> > > >>>>> putting the non-ddp stuff in mbufs/mbuf_clusters.
> > > >>>>> It would do the ddp-read of the data into anonymous pages it allocates
> > > >>>>> and would associate these with M_EXTPG mbufs.
> > > >>>>> It would put any remaining payload stream stuff for the RPC message in
> > > >>>>> additional mbufs/mbuf_clusters.
> > > >>>>> --> Call the NFS server with the mbuf list for processing.
> > > >>>>>       - When the NFS server gets to the write data (in M_EXTPG mbufs)
> > > >>>>>         it would set up a uio/iovec for the pages and call VOP_WRITE().
> > > >>>>>
> > > >>>>> Now, the above is straightforward for me, since I know the NFS and
> > > >>>>> krpc code fairly well.
> > > >>>>> But that is where my expertise ends.
> > > >>>>>
> > > >>>>> So, what kind of calls do the drivers provide to send and receive
> > > >>>>> what RFC-8166 calls the payload stream?
> > > >>>>>
> > > >>>>> And what kind of calls do the drivers provide to write and read DDP
> > > >>>>> chunks?
> > > >>>>>
> > > >>>>> Also, if the above sounds way off the mark, please let me know.
> > > >>>>
> > > >>>> What you need is, most likely, the infiniband API or KPI to handle
> > > >>>> RDMA.  It is driver-independent, same as for ip NFS you use system IP
> > > >>>> stack and not call to ethernet drivers.  In fact, most likely the
> > > >>>> transport used would be not native IB, but IB over UDP (RoCE v2).
> > > >>>>
> > > >>>> IB verbs, which is the official interface for both kernel and user mode,
> > > >>>> are not well documented.  An overview is provided by the document
> > > >>>> titled "RDMA Aware Networks Programming User Manual", which should
> > > >>>> be google-able.  Otherwise, the Infiniband specication is the reference.
> This manual is good at explaining how things work, but the detailed example
> isn't very useful (the verbs it uses aren't in the kernel, etc). It
> might be more useful
> for userspace library use?
Just fyi, the functions named rdma_XXX() seem to be the ones used
to get things set up and then the ones named ib_XXX() are used for
the actual I/O. (The manual has ones named ibv_XXX(), which don't
exist in the kernel code, afaik.)

rick

>
> The good news is I found a file in the Linux kernel sources which I
> find quite readable (it does rdma for their krpc).
> The really good news is that it is dual licensed, so I think it can
> be pulled into FreeBSD without problems.
> I haven't yet decided if I want to try and keep it mostly intact (so that
> bugfixes can be pulled from Linux for it) or just hack it up to get
> what I want from it. (The Linux krpc, etc. is quite different, so it
> would need a lot of #ifdef FreeBSD in it.)
>
> Anyhow, here is the copyright, to double check this is ok in FreeBSD?
>
> // SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
> /*
>  * Copyright (c) 2014-2017 Oracle.  All rights reserved.
>  * Copyright (c) 2003-2007 Network Appliance, Inc. All rights reserved.
>  *
>  * This software is available to you under a choice of one of two
>  * licenses.  You may choose to be licensed under the terms of the GNU
>  * General Public License (GPL) Version 2, available from the file
>  * COPYING in the main directory of this source tree, or the BSD-type
>  * license below:
>  *
>  * Redistribution and use in source and binary forms, with or without
>  * modification, are permitted provided that the following conditions
>  * are met:
>  *
>  *      Redistributions of source code must retain the above copyright
>  *      notice, this list of conditions and the following disclaimer.
>  *
>  *      Redistributions in binary form must reproduce the above
>  *      copyright notice, this list of conditions and the following
>  *      disclaimer in the documentation and/or other materials provided
>  *      with the distribution.
>  *
>  *      Neither the name of the Network Appliance, Inc. nor the names of
>  *      its contributors may be used to endorse or promote products
>  *      derived from this software without specific prior written
>  *      permission.
>  *
>  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
>  * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
>  * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
>  * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
>  * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
>  * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
>  * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
>  * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
>  * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
>  * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
>  * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
>  */
>
> rick
>
> > > >>> Thanks. I'll look at that. (I notice that the Intel code references something
> > > >>> they call Linux-OpenIB. Hopefully that looks about the same and the
> > > >>> glue needed to support non-Mellanox drivers isn't too difficult?)
> > > >> OpenIB is perhaps the reference to the IB code in Linux kernel proper
> > > >> plus userspace libraries from rdma-core.  This is what was forked/grown
> > > >> from OFED.
> > > >>
> > > >> Intel put efforts into the iWARP, which is sort of alternative for RoCEv2.
> > > >> It has RFCs and works over TCP AFAIR, which causes problems for it.
> > > > Heh, heh. I'm trying to avoid the iWARP vs RoCE wars.;-)
> > > > (I did see a Mellanox white paper with graphs showing how RoCE outperforms
> > > > iWARP.)
> > > > Intel currently claims to support RoCE on its 810 and 820 NICs.
> > > > Broadcom also claims to support RoCE, but doesn't mention FreeBSD
> > > > drivers and Chelsio does iWARP, afaik.
> > > >
> > > > For some reason, at the last NFSv4 Bakeathon, Chuck was testing with
> > > > iWARP and not RoCE? (I haven't asked Chuck why he chose that. It
> > > > might just be more convenient to set up the siw driver in Linux vs the
> > > > rxe one? He is the main author of RFC-8166, so he's the NFS-over-RDMA guy.)
> > > >
> > > > But it does look like a fun project for the next year. (I recall jhb@ mentioning
> > > > that NFS-over-TLS wouldn't be easy and it turned out to be a fun
> > > > little project.)
> > >
> > > Konstantin is right though that sys/ofed is Linux OpenIB and has an interface
> > > that should let you do RDMA (both ROCEv2 and iWARP).  I'm hoping to use the APIs
> > > in sys/ofed to support NVMe over RDMA (both ROCEv2 and iWARP) at some point as
> > > well.
> > > > rick
> > > >
> > > >>
> > > >>>
> > > >>> Btw, if anyone is interested in taking a more active involvement in this,
> > > >>> they are more than welcome to do so. (I'm going to be starting where I
> > > >>> understand things in the krpc/nfs. I'm not looking forward to porting rxe,
> > > >>> but will probably end up there. I have already had one offer w.r.t. access
> > > >>> to a lab that includes Mellanox hardware, but I don't know if remote
> > > >>> debugging will be practical yet.)
> > > >>>
> > > >>> rick
> > > >>>
> > > >>>>
> > > >>>> The IB implementation for us is still called OFED for historical reasons,
> > > >>>> and it is located in sys/ofed.
> > > >>>>
> > > >>>>>
> > > >>>>> As for testing, I am planning on hacking away at one of the RDMA
> > > >>>>> in software drivers in Linux to get it working well enough to use for
> > > >>>>> testing. Whatever seems to be easiest to get kinda working.
> > > >>>> Yes rxe driver is the sw RoCE v2 implementation.  We looked at the
> > > >>>> amount of work to port it.  Its size is ~12 kLoC, which is compatible
> > > >>>> with libibverbs (userspace core infiniband interface).
> > >
> > > Interesting.  I'm currently working on merging back several OFED commits from
> > > Linux to sys/ofed (currently I have about 30 commits merged, some older than
> > > Hans' last merge, and some newer, some of the newer ones should permit removing
> > > compat stubs for some of the newer APIs that are duplicated in bnxt, irdma, and
> > > mlx*).  When I get a bit further along I'll post the branch I have for more
> > > testing (it is a bunch of individual cherry-picks rather than a giant merge).
> > >
> > > Porting over rxe could be useful for me as well for some work I am doing.
> > I have https://github.com/rmacklem/freebsd-rdma. For now, I'll only be doing
> > commits to it for the NFS and krpc files.  It will be a while before anything in
> > it is useful for others.
> >
> > I'll email when I get into the rxe port. (If you hurry, you can beat me to it;-)
> >
> > Others are welcome to push/pull on the above. (Email if you need permissions
> > changes. I know diddly about github.)
> >
> > rick
> >
> > >
> > > --
> > > John Baldwin
> > >