From nobody Wed Nov 05 15:52:37 2025 X-Original-To: freebsd-current@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4d1qbT01vfz6GWrx for ; Wed, 05 Nov 2025 15:52:53 +0000 (UTC) (envelope-from rick.macklem@gmail.com) Received: from mail-ed1-x52e.google.com (mail-ed1-x52e.google.com [IPv6:2a00:1450:4864:20::52e]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "WR4" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4d1qbS43PVz3Q2j for ; Wed, 05 Nov 2025 15:52:52 +0000 (UTC) (envelope-from rick.macklem@gmail.com) Authentication-Results: mx1.freebsd.org; dkim=pass header.d=gmail.com header.s=20230601 header.b=gY782m+v; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (mx1.freebsd.org: domain of rick.macklem@gmail.com designates 2a00:1450:4864:20::52e as permitted sender) smtp.mailfrom=rick.macklem@gmail.com Received: by mail-ed1-x52e.google.com with SMTP id 4fb4d7f45d1cf-640b0639dabso6942138a12.3 for ; Wed, 05 Nov 2025 07:52:52 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1762357971; x=1762962771; darn=freebsd.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=tbwWMlYYo075jJ3uvG3M5OWVBGm9ydZJMi+x5EAw+Fg=; b=gY782m+vVsIffaAd5uG7S4rQUILyROW9oKiarZLP1rLQJ1jd2TLHX1NH0dO9cN+/IB 7wRz05oGykAylXEYLtc8IYYGG8BW5t81rIov5u/Qd1ERqWXtdYR6Cbd/G61HnGYJUcac dRsFhKQgWp0EiiR2AA4Hn2dyM79dsSv6EHQQQZBSMzia1EJdqAGK+bpzTyw26ggWZzv4 TNnQ2Ns3akeOIw/FawqxN5oKNQtiCS0ghpCPgv59VBMvn+iGUzgjyskt5bTi435PF7Y4 6Lw+VG8ue5TIGpBFRnRgRxVEoz4dVXKNwgCuFooCdw2aOC9bRmGU1mb+06tWgnI047AC umpA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1762357971; x=1762962771; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=tbwWMlYYo075jJ3uvG3M5OWVBGm9ydZJMi+x5EAw+Fg=; b=eSsAjS7Llvx42LiqdleKIfOXjcGJ8P9Bc1woP1j9f60PaGqCx0ldDYVcU5wXHxUfPR xAY4QiXomNGPDxhhl7Kak1NnRoO/IMhX0Qb+NGxlVbovrn22BrYpO1s5VKXmzi0rfhmp ihkbvGlqLvJlyZHrwPRN/dSief8vmjchw7NFAmapT3btae4By+xlyf4lCXadkf+FO9As 2q8motLDLUX/mHK9rPoN1Dc4aTliSlhTUW2DVlQ/IjOs8A4s0M+WKnfRCqeQD056w4u8 hwu3VyjO/IpKFfsD1fUWhhXNE0QtzzgTCuhnxM8K0tD6qTMXPNAuqFV3YnzVstVxZxxp 3wpw== X-Forwarded-Encrypted: i=1; AJvYcCXuaOO5sHkx2hXCNgrhXkYcv5jJbSkIDPhHx83q0s0RA2gfcJ7oAqSiFKtTFrSvZbr+a8PbEMmcmbuZSBtZqIw=@freebsd.org X-Gm-Message-State: AOJu0Yw2SnStEXaY5nZC/q0bGGLldLEH1DwGBwibAD2UyeetGLED/4f0 M+PnDkKP4RBU+NC6yoqLf2o9x/y1YkkCksjFEQxKmm3W7WF4NbueeOJckHckxjrRY+2k55bt1pG h+108OxG9gs6cTHwneUtfN2mRhWWGkA== X-Gm-Gg: ASbGncu08o8DUgd7CDxFPsyZa4j5YfdG7Lu2KdbvzQJ1B/fWwSYcGLFlUyIewsysJzP bn++Ldx4JZgkpRfipC902b8gs0FCD2EUZ2GUgCcT7owxzwcX/SMIgRA4VYix8mG5pCpMWLJb/T8 ioOiqtCr2VNmjGzabWgVBhdoQamGx4Wo6Mp/triyXvRIAOE9rxrHebf4hlgEFT8jXXJwXDOGlPg Hu7ozzdYvdi36CHRB0LbujTpMq3Jv0CMF1Zr5I7k36HgkiQxn1q/c3PueQL9fCbNfrC1s6O4hKw Nhu83eP56AsfU5fb X-Google-Smtp-Source: AGHT+IEE7+e6DqRb37/4nfYb5jIn2mZERxR6l47UmG4qBxkEPYJDY4YR/PpkWASnw8ktvqG/5wq3GqyXoJKXYrY/vTI= X-Received: by 2002:a05:6402:42ca:b0:640:eea7:c950 with SMTP id 4fb4d7f45d1cf-641058b3018mr3533287a12.13.1762357971069; Wed, 05 Nov 2025 07:52:51 -0800 (PST) List-Id: Discussions about the use of FreeBSD-current List-Archive: https://lists.freebsd.org/archives/freebsd-current List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-current@FreeBSD.org MIME-Version: 1.0 References: In-Reply-To: From: Rick Macklem Date: Wed, 5 Nov 2025 07:52:37 -0800 X-Gm-Features: AWmQ_bndp5kdvkmAvC9xPqx21xRIfAjmohwl3ZmuVWyddUTI0Nlw5llTOoTY7K0 Message-ID: Subject: Re: RFC: NFS over RDMA To: John Baldwin Cc: Konstantin Belousov , FreeBSD CURRENT , Navdeep Parhar , "erj@freebsd.org" , "aehrenberg@nvidia.com" , slavash@nvidia.com, "sreekanth.reddy@broadcom.com" Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spamd-Bar: --- X-Spamd-Result: default: False [-3.95 / 15.00]; NEURAL_HAM_LONG(-1.00)[-1.000]; NEURAL_HAM_MEDIUM(-1.00)[-0.996]; NEURAL_HAM_SHORT(-0.95)[-0.950]; DMARC_POLICY_ALLOW(-0.50)[gmail.com,none]; R_DKIM_ALLOW(-0.20)[gmail.com:s=20230601]; R_SPF_ALLOW(-0.20)[+ip6:2a00:1450:4000::/36:c]; MIME_GOOD(-0.10)[text/plain]; TAGGED_FROM(0.00)[]; RCVD_TLS_LAST(0.00)[]; ARC_NA(0.00)[]; MIME_TRACE(0.00)[0:+]; DWL_DNSWL_NONE(0.00)[gmail.com:dkim]; TO_DN_EQ_ADDR_SOME(0.00)[]; FREEMAIL_FROM(0.00)[gmail.com]; TO_DN_SOME(0.00)[]; DKIM_TRACE(0.00)[gmail.com:+]; FROM_HAS_DN(0.00)[]; FREEMAIL_ENVFROM(0.00)[gmail.com]; PREVIOUSLY_DELIVERED(0.00)[freebsd-current@freebsd.org]; TO_MATCH_ENVRCPT_SOME(0.00)[]; FROM_EQ_ENVFROM(0.00)[]; MISSING_XM_UA(0.00)[]; MID_RHS_MATCH_FROMTLD(0.00)[]; RCPT_COUNT_SEVEN(0.00)[8]; MLMMJ_DEST(0.00)[freebsd-current@freebsd.org]; RCVD_COUNT_ONE(0.00)[1]; ASN(0.00)[asn:15169, ipnet:2a00:1450::/32, country:US]; RCVD_IN_DNSWL_NONE(0.00)[2a00:1450:4864:20::52e:from] X-Rspamd-Queue-Id: 4d1qbS43PVz3Q2j On Wed, Nov 5, 2025 at 7:47=E2=80=AFAM Rick Macklem wrote: > > On Mon, Nov 3, 2025 at 10:10=E2=80=AFPM Rick Macklem wrote: > > > > On Mon, Nov 3, 2025 at 6:35=E2=80=AFAM John Baldwin w= rote: > > > > > > On 11/1/25 17:26, Rick Macklem wrote: > > > > On Sat, Nov 1, 2025 at 2:10=E2=80=AFPM Konstantin Belousov wrote: > > > >> > > > >> On Sat, Nov 01, 2025 at 02:03:59PM -0700, Rick Macklem wrote: > > > >>> On Sat, Nov 1, 2025 at 1:50=E2=80=AFPM Konstantin Belousov wrote: > > > >>>> > > > >>>> Added Slava Schwartsman. > > > >>>> > > > >>>> On Sat, Nov 01, 2025 at 01:11:02PM -0700, Rick Macklem wrote: > > > >>>>> Hi, > > > >>>>> > > > >>>>> I've had NFS over RDMA on my todo list for a very loonnnggg > > > >>>>> time. I've avoided it because I haven't had a way to test it, > > > >>>>> but I'm now going to start working on it. (A bunch of this work > > > >>>>> is already done for NFS-over-TLS which added code for handling > > > >>>>> M_EXTPG mbufs.) > > > >>>>> > > > >>>>> >From RFC-8166, there appears to be 4 operations the krpc > > > >>>>> needs to do: > > > >>>>> send-rdma - Send on the payload stream (sending messages that > > > >>>>> are kept in order). > > > >>>>> recv-rdma - Receive the above. > > > >>>>> ddp-write - Do a write of DDP data. > > > >>>>> ddp-read - Do a read of DDP data. > > > >>>>> > > > >>>>> So, here is how I see the krpc doing this. > > > >>>>> An NFS write RPC for example: > > > >>>>> - The NFS client code packages the Write RPC XDR as follows: > > > >>>>> - 1 or more mbufs/mbuf_clusters of XDR for the NFS arguments > > > >>>>> that precede the write data. > > > >>>>> - an mbuf that indicates "start of ddp-read". (Maybe use M_P= ROTO1?) > > > >>>>> - 1 or more M_EXTPG mbugs with page(s) loaded with the data = to be > > > >>>>> written. > > > >>>>> - 0 or more mbufs/mbuf_clusters with additional RPC request = XDR. > > > >>>>> > > > >>>>> This would be passed to the krpc which would... > > > >>>>> - the mbufs up to "start of ddp" in the payload stream. > > > >>>>> - Would specify a ddp-read for the pages from the M_EXTPG mbu= fs > > > >>>>> and send that in the payload stream. > > > >>>>> - send the remaining mbufs/mbuf_clusters in the payload strea= m > > > >>>>> > > > >>>>> The NFS server end would process the received payload stream, > > > >>>>> putting the non-ddp stuff in mbufs/mbuf_clusters. > > > >>>>> It would do the ddp-read of the data into anonymous pages it al= locates > > > >>>>> and would associate these with M_EXTPG mbufs. > > > >>>>> It would put any remaining payload stream stuff for the RPC mes= sage in > > > >>>>> additional mbufs/mbuf_clusters. > > > >>>>> --> Call the NFS server with the mbuf list for processing. > > > >>>>> - When the NFS server gets to the write data (in M_EXTPG = mbufs) > > > >>>>> it would set up a uio/iovec for the pages and call VOP_= WRITE(). > > > >>>>> > > > >>>>> Now, the above is straightforward for me, since I know the NFS = and > > > >>>>> krpc code fairly well. > > > >>>>> But that is where my expertise ends. > > > >>>>> > > > >>>>> So, what kind of calls do the drivers provide to send and recei= ve > > > >>>>> what RFC-8166 calls the payload stream? > > > >>>>> > > > >>>>> And what kind of calls do the drivers provide to write and read= DDP > > > >>>>> chunks? > > > >>>>> > > > >>>>> Also, if the above sounds way off the mark, please let me know. > > > >>>> > > > >>>> What you need is, most likely, the infiniband API or KPI to hand= le > > > >>>> RDMA. It is driver-independent, same as for ip NFS you use syst= em IP > > > >>>> stack and not call to ethernet drivers. In fact, most likely th= e > > > >>>> transport used would be not native IB, but IB over UDP (RoCE v2)= . > > > >>>> > > > >>>> IB verbs, which is the official interface for both kernel and us= er mode, > > > >>>> are not well documented. An overview is provided by the documen= t > > > >>>> titled "RDMA Aware Networks Programming User Manual", which shou= ld > > > >>>> be google-able. Otherwise, the Infiniband specication is the re= ference. > This manual is good at explaining how things work, but the detailed examp= le > isn't very useful (the verbs it uses aren't in the kernel, etc). It > might be more useful > for userspace library use? Just fyi, the functions named rdma_XXX() seem to be the ones used to get things set up and then the ones named ib_XXX() are used for the actual I/O. (The manual has ones named ibv_XXX(), which don't exist in the kernel code, afaik.) rick > > The good news is I found a file in the Linux kernel sources which I > find quite readable (it does rdma for their krpc). > The really good news is that it is dual licensed, so I think it can > be pulled into FreeBSD without problems. > I haven't yet decided if I want to try and keep it mostly intact (so that > bugfixes can be pulled from Linux for it) or just hack it up to get > what I want from it. (The Linux krpc, etc. is quite different, so it > would need a lot of #ifdef FreeBSD in it.) > > Anyhow, here is the copyright, to double check this is ok in FreeBSD? > > // SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause > /* > * Copyright (c) 2014-2017 Oracle. All rights reserved. > * Copyright (c) 2003-2007 Network Appliance, Inc. All rights reserved. > * > * This software is available to you under a choice of one of two > * licenses. You may choose to be licensed under the terms of the GNU > * General Public License (GPL) Version 2, available from the file > * COPYING in the main directory of this source tree, or the BSD-type > * license below: > * > * Redistribution and use in source and binary forms, with or without > * modification, are permitted provided that the following conditions > * are met: > * > * Redistributions of source code must retain the above copyright > * notice, this list of conditions and the following disclaimer. > * > * Redistributions in binary form must reproduce the above > * copyright notice, this list of conditions and the following > * disclaimer in the documentation and/or other materials provided > * with the distribution. > * > * Neither the name of the Network Appliance, Inc. nor the names of > * its contributors may be used to endorse or promote products > * derived from this software without specific prior written > * permission. > * > * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS > * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT > * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR > * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT > * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, > * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT > * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, > * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY > * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT > * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE > * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. > */ > > rick > > > > >>> Thanks. I'll look at that. (I notice that the Intel code referenc= es something > > > >>> they call Linux-OpenIB. Hopefully that looks about the same and t= he > > > >>> glue needed to support non-Mellanox drivers isn't too difficult?) > > > >> OpenIB is perhaps the reference to the IB code in Linux kernel pro= per > > > >> plus userspace libraries from rdma-core. This is what was forked/= grown > > > >> from OFED. > > > >> > > > >> Intel put efforts into the iWARP, which is sort of alternative for= RoCEv2. > > > >> It has RFCs and works over TCP AFAIR, which causes problems for it= . > > > > Heh, heh. I'm trying to avoid the iWARP vs RoCE wars.;-) > > > > (I did see a Mellanox white paper with graphs showing how RoCE outp= erforms > > > > iWARP.) > > > > Intel currently claims to support RoCE on its 810 and 820 NICs. > > > > Broadcom also claims to support RoCE, but doesn't mention FreeBSD > > > > drivers and Chelsio does iWARP, afaik. > > > > > > > > For some reason, at the last NFSv4 Bakeathon, Chuck was testing wit= h > > > > iWARP and not RoCE? (I haven't asked Chuck why he chose that. It > > > > might just be more convenient to set up the siw driver in Linux vs = the > > > > rxe one? He is the main author of RFC-8166, so he's the NFS-over-RD= MA guy.) > > > > > > > > But it does look like a fun project for the next year. (I recall jh= b@ mentioning > > > > that NFS-over-TLS wouldn't be easy and it turned out to be a fun > > > > little project.) > > > > > > Konstantin is right though that sys/ofed is Linux OpenIB and has an i= nterface > > > that should let you do RDMA (both ROCEv2 and iWARP). I'm hoping to u= se the APIs > > > in sys/ofed to support NVMe over RDMA (both ROCEv2 and iWARP) at some= point as > > > well. > > > > rick > > > > > > > >> > > > >>> > > > >>> Btw, if anyone is interested in taking a more active involvement = in this, > > > >>> they are more than welcome to do so. (I'm going to be starting wh= ere I > > > >>> understand things in the krpc/nfs. I'm not looking forward to por= ting rxe, > > > >>> but will probably end up there. I have already had one offer w.r.= t. access > > > >>> to a lab that includes Mellanox hardware, but I don't know if rem= ote > > > >>> debugging will be practical yet.) > > > >>> > > > >>> rick > > > >>> > > > >>>> > > > >>>> The IB implementation for us is still called OFED for historical= reasons, > > > >>>> and it is located in sys/ofed. > > > >>>> > > > >>>>> > > > >>>>> As for testing, I am planning on hacking away at one of the RDM= A > > > >>>>> in software drivers in Linux to get it working well enough to u= se for > > > >>>>> testing. Whatever seems to be easiest to get kinda working. > > > >>>> Yes rxe driver is the sw RoCE v2 implementation. We looked at t= he > > > >>>> amount of work to port it. Its size is ~12 kLoC, which is compa= tible > > > >>>> with libibverbs (userspace core infiniband interface). > > > > > > Interesting. I'm currently working on merging back several OFED comm= its from > > > Linux to sys/ofed (currently I have about 30 commits merged, some old= er than > > > Hans' last merge, and some newer, some of the newer ones should permi= t removing > > > compat stubs for some of the newer APIs that are duplicated in bnxt, = irdma, and > > > mlx*). When I get a bit further along I'll post the branch I have fo= r more > > > testing (it is a bunch of individual cherry-picks rather than a giant= merge). > > > > > > Porting over rxe could be useful for me as well for some work I am do= ing. > > I have https://github.com/rmacklem/freebsd-rdma. For now, I'll only be = doing > > commits to it for the NFS and krpc files. It will be a while before an= ything in > > it is useful for others. > > > > I'll email when I get into the rxe port. (If you hurry, you can beat me= to it;-) > > > > Others are welcome to push/pull on the above. (Email if you need permis= sions > > changes. I know diddly about github.) > > > > rick > > > > > > > > -- > > > John Baldwin > > >