Re: RFC: NFS over RDMA
- In reply to: Rick Macklem : "Re: RFC: NFS over RDMA"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Wed, 05 Nov 2025 15:52:37 UTC
On Wed, Nov 5, 2025 at 7:47 AM Rick Macklem <rick.macklem@gmail.com> wrote: > > On Mon, Nov 3, 2025 at 10:10 PM Rick Macklem <rick.macklem@gmail.com> wrote: > > > > On Mon, Nov 3, 2025 at 6:35 AM John Baldwin <jhb@freebsd.org> wrote: > > > > > > On 11/1/25 17:26, Rick Macklem wrote: > > > > On Sat, Nov 1, 2025 at 2:10 PM Konstantin Belousov <kib@freebsd.org> wrote: > > > >> > > > >> On Sat, Nov 01, 2025 at 02:03:59PM -0700, Rick Macklem wrote: > > > >>> On Sat, Nov 1, 2025 at 1:50 PM Konstantin Belousov <kib@freebsd.org> wrote: > > > >>>> > > > >>>> Added Slava Schwartsman. > > > >>>> > > > >>>> On Sat, Nov 01, 2025 at 01:11:02PM -0700, Rick Macklem wrote: > > > >>>>> Hi, > > > >>>>> > > > >>>>> I've had NFS over RDMA on my todo list for a very loonnnggg > > > >>>>> time. I've avoided it because I haven't had a way to test it, > > > >>>>> but I'm now going to start working on it. (A bunch of this work > > > >>>>> is already done for NFS-over-TLS which added code for handling > > > >>>>> M_EXTPG mbufs.) > > > >>>>> > > > >>>>> >From RFC-8166, there appears to be 4 operations the krpc > > > >>>>> needs to do: > > > >>>>> send-rdma - Send on the payload stream (sending messages that > > > >>>>> are kept in order). > > > >>>>> recv-rdma - Receive the above. > > > >>>>> ddp-write - Do a write of DDP data. > > > >>>>> ddp-read - Do a read of DDP data. > > > >>>>> > > > >>>>> So, here is how I see the krpc doing this. > > > >>>>> An NFS write RPC for example: > > > >>>>> - The NFS client code packages the Write RPC XDR as follows: > > > >>>>> - 1 or more mbufs/mbuf_clusters of XDR for the NFS arguments > > > >>>>> that precede the write data. > > > >>>>> - an mbuf that indicates "start of ddp-read". (Maybe use M_PROTO1?) > > > >>>>> - 1 or more M_EXTPG mbugs with page(s) loaded with the data to be > > > >>>>> written. > > > >>>>> - 0 or more mbufs/mbuf_clusters with additional RPC request XDR. > > > >>>>> > > > >>>>> This would be passed to the krpc which would... > > > >>>>> - the mbufs up to "start of ddp" in the payload stream. > > > >>>>> - Would specify a ddp-read for the pages from the M_EXTPG mbufs > > > >>>>> and send that in the payload stream. > > > >>>>> - send the remaining mbufs/mbuf_clusters in the payload stream > > > >>>>> > > > >>>>> The NFS server end would process the received payload stream, > > > >>>>> putting the non-ddp stuff in mbufs/mbuf_clusters. > > > >>>>> It would do the ddp-read of the data into anonymous pages it allocates > > > >>>>> and would associate these with M_EXTPG mbufs. > > > >>>>> It would put any remaining payload stream stuff for the RPC message in > > > >>>>> additional mbufs/mbuf_clusters. > > > >>>>> --> Call the NFS server with the mbuf list for processing. > > > >>>>> - When the NFS server gets to the write data (in M_EXTPG mbufs) > > > >>>>> it would set up a uio/iovec for the pages and call VOP_WRITE(). > > > >>>>> > > > >>>>> Now, the above is straightforward for me, since I know the NFS and > > > >>>>> krpc code fairly well. > > > >>>>> But that is where my expertise ends. > > > >>>>> > > > >>>>> So, what kind of calls do the drivers provide to send and receive > > > >>>>> what RFC-8166 calls the payload stream? > > > >>>>> > > > >>>>> And what kind of calls do the drivers provide to write and read DDP > > > >>>>> chunks? > > > >>>>> > > > >>>>> Also, if the above sounds way off the mark, please let me know. > > > >>>> > > > >>>> What you need is, most likely, the infiniband API or KPI to handle > > > >>>> RDMA. It is driver-independent, same as for ip NFS you use system IP > > > >>>> stack and not call to ethernet drivers. In fact, most likely the > > > >>>> transport used would be not native IB, but IB over UDP (RoCE v2). > > > >>>> > > > >>>> IB verbs, which is the official interface for both kernel and user mode, > > > >>>> are not well documented. An overview is provided by the document > > > >>>> titled "RDMA Aware Networks Programming User Manual", which should > > > >>>> be google-able. Otherwise, the Infiniband specication is the reference. > This manual is good at explaining how things work, but the detailed example > isn't very useful (the verbs it uses aren't in the kernel, etc). It > might be more useful > for userspace library use? Just fyi, the functions named rdma_XXX() seem to be the ones used to get things set up and then the ones named ib_XXX() are used for the actual I/O. (The manual has ones named ibv_XXX(), which don't exist in the kernel code, afaik.) rick > > The good news is I found a file in the Linux kernel sources which I > find quite readable (it does rdma for their krpc). > The really good news is that it is dual licensed, so I think it can > be pulled into FreeBSD without problems. > I haven't yet decided if I want to try and keep it mostly intact (so that > bugfixes can be pulled from Linux for it) or just hack it up to get > what I want from it. (The Linux krpc, etc. is quite different, so it > would need a lot of #ifdef FreeBSD in it.) > > Anyhow, here is the copyright, to double check this is ok in FreeBSD? > > // SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause > /* > * Copyright (c) 2014-2017 Oracle. All rights reserved. > * Copyright (c) 2003-2007 Network Appliance, Inc. All rights reserved. > * > * This software is available to you under a choice of one of two > * licenses. You may choose to be licensed under the terms of the GNU > * General Public License (GPL) Version 2, available from the file > * COPYING in the main directory of this source tree, or the BSD-type > * license below: > * > * Redistribution and use in source and binary forms, with or without > * modification, are permitted provided that the following conditions > * are met: > * > * Redistributions of source code must retain the above copyright > * notice, this list of conditions and the following disclaimer. > * > * Redistributions in binary form must reproduce the above > * copyright notice, this list of conditions and the following > * disclaimer in the documentation and/or other materials provided > * with the distribution. > * > * Neither the name of the Network Appliance, Inc. nor the names of > * its contributors may be used to endorse or promote products > * derived from this software without specific prior written > * permission. > * > * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS > * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT > * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR > * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT > * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, > * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT > * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, > * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY > * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT > * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE > * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. > */ > > rick > > > > >>> Thanks. I'll look at that. (I notice that the Intel code references something > > > >>> they call Linux-OpenIB. Hopefully that looks about the same and the > > > >>> glue needed to support non-Mellanox drivers isn't too difficult?) > > > >> OpenIB is perhaps the reference to the IB code in Linux kernel proper > > > >> plus userspace libraries from rdma-core. This is what was forked/grown > > > >> from OFED. > > > >> > > > >> Intel put efforts into the iWARP, which is sort of alternative for RoCEv2. > > > >> It has RFCs and works over TCP AFAIR, which causes problems for it. > > > > Heh, heh. I'm trying to avoid the iWARP vs RoCE wars.;-) > > > > (I did see a Mellanox white paper with graphs showing how RoCE outperforms > > > > iWARP.) > > > > Intel currently claims to support RoCE on its 810 and 820 NICs. > > > > Broadcom also claims to support RoCE, but doesn't mention FreeBSD > > > > drivers and Chelsio does iWARP, afaik. > > > > > > > > For some reason, at the last NFSv4 Bakeathon, Chuck was testing with > > > > iWARP and not RoCE? (I haven't asked Chuck why he chose that. It > > > > might just be more convenient to set up the siw driver in Linux vs the > > > > rxe one? He is the main author of RFC-8166, so he's the NFS-over-RDMA guy.) > > > > > > > > But it does look like a fun project for the next year. (I recall jhb@ mentioning > > > > that NFS-over-TLS wouldn't be easy and it turned out to be a fun > > > > little project.) > > > > > > Konstantin is right though that sys/ofed is Linux OpenIB and has an interface > > > that should let you do RDMA (both ROCEv2 and iWARP). I'm hoping to use the APIs > > > in sys/ofed to support NVMe over RDMA (both ROCEv2 and iWARP) at some point as > > > well. > > > > rick > > > > > > > >> > > > >>> > > > >>> Btw, if anyone is interested in taking a more active involvement in this, > > > >>> they are more than welcome to do so. (I'm going to be starting where I > > > >>> understand things in the krpc/nfs. I'm not looking forward to porting rxe, > > > >>> but will probably end up there. I have already had one offer w.r.t. access > > > >>> to a lab that includes Mellanox hardware, but I don't know if remote > > > >>> debugging will be practical yet.) > > > >>> > > > >>> rick > > > >>> > > > >>>> > > > >>>> The IB implementation for us is still called OFED for historical reasons, > > > >>>> and it is located in sys/ofed. > > > >>>> > > > >>>>> > > > >>>>> As for testing, I am planning on hacking away at one of the RDMA > > > >>>>> in software drivers in Linux to get it working well enough to use for > > > >>>>> testing. Whatever seems to be easiest to get kinda working. > > > >>>> Yes rxe driver is the sw RoCE v2 implementation. We looked at the > > > >>>> amount of work to port it. Its size is ~12 kLoC, which is compatible > > > >>>> with libibverbs (userspace core infiniband interface). > > > > > > Interesting. I'm currently working on merging back several OFED commits from > > > Linux to sys/ofed (currently I have about 30 commits merged, some older than > > > Hans' last merge, and some newer, some of the newer ones should permit removing > > > compat stubs for some of the newer APIs that are duplicated in bnxt, irdma, and > > > mlx*). When I get a bit further along I'll post the branch I have for more > > > testing (it is a bunch of individual cherry-picks rather than a giant merge). > > > > > > Porting over rxe could be useful for me as well for some work I am doing. > > I have https://github.com/rmacklem/freebsd-rdma. For now, I'll only be doing > > commits to it for the NFS and krpc files. It will be a while before anything in > > it is useful for others. > > > > I'll email when I get into the rxe port. (If you hurry, you can beat me to it;-) > > > > Others are welcome to push/pull on the above. (Email if you need permissions > > changes. I know diddly about github.) > > > > rick > > > > > > > > -- > > > John Baldwin > > >