Re: RFC: NFS over RDMA
- Reply: Rick Macklem : "Re: RFC: NFS over RDMA"
- In reply to: Rick Macklem : "Re: RFC: NFS over RDMA"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Sat, 01 Nov 2025 21:09:51 UTC
On Sat, Nov 01, 2025 at 02:03:59PM -0700, Rick Macklem wrote: > On Sat, Nov 1, 2025 at 1:50 PM Konstantin Belousov <kib@freebsd.org> wrote: > > > > Added Slava Schwartsman. > > > > On Sat, Nov 01, 2025 at 01:11:02PM -0700, Rick Macklem wrote: > > > Hi, > > > > > > I've had NFS over RDMA on my todo list for a very loonnnggg > > > time. I've avoided it because I haven't had a way to test it, > > > but I'm now going to start working on it. (A bunch of this work > > > is already done for NFS-over-TLS which added code for handling > > > M_EXTPG mbufs.) > > > > > > >From RFC-8166, there appears to be 4 operations the krpc > > > needs to do: > > > send-rdma - Send on the payload stream (sending messages that > > > are kept in order). > > > recv-rdma - Receive the above. > > > ddp-write - Do a write of DDP data. > > > ddp-read - Do a read of DDP data. > > > > > > So, here is how I see the krpc doing this. > > > An NFS write RPC for example: > > > - The NFS client code packages the Write RPC XDR as follows: > > > - 1 or more mbufs/mbuf_clusters of XDR for the NFS arguments > > > that precede the write data. > > > - an mbuf that indicates "start of ddp-read". (Maybe use M_PROTO1?) > > > - 1 or more M_EXTPG mbugs with page(s) loaded with the data to be > > > written. > > > - 0 or more mbufs/mbuf_clusters with additional RPC request XDR. > > > > > > This would be passed to the krpc which would... > > > - the mbufs up to "start of ddp" in the payload stream. > > > - Would specify a ddp-read for the pages from the M_EXTPG mbufs > > > and send that in the payload stream. > > > - send the remaining mbufs/mbuf_clusters in the payload stream > > > > > > The NFS server end would process the received payload stream, > > > putting the non-ddp stuff in mbufs/mbuf_clusters. > > > It would do the ddp-read of the data into anonymous pages it allocates > > > and would associate these with M_EXTPG mbufs. > > > It would put any remaining payload stream stuff for the RPC message in > > > additional mbufs/mbuf_clusters. > > > --> Call the NFS server with the mbuf list for processing. > > > - When the NFS server gets to the write data (in M_EXTPG mbufs) > > > it would set up a uio/iovec for the pages and call VOP_WRITE(). > > > > > > Now, the above is straightforward for me, since I know the NFS and > > > krpc code fairly well. > > > But that is where my expertise ends. > > > > > > So, what kind of calls do the drivers provide to send and receive > > > what RFC-8166 calls the payload stream? > > > > > > And what kind of calls do the drivers provide to write and read DDP > > > chunks? > > > > > > Also, if the above sounds way off the mark, please let me know. > > > > What you need is, most likely, the infiniband API or KPI to handle > > RDMA. It is driver-independent, same as for ip NFS you use system IP > > stack and not call to ethernet drivers. In fact, most likely the > > transport used would be not native IB, but IB over UDP (RoCE v2). > > > > IB verbs, which is the official interface for both kernel and user mode, > > are not well documented. An overview is provided by the document > > titled "RDMA Aware Networks Programming User Manual", which should > > be google-able. Otherwise, the Infiniband specication is the reference. > Thanks. I'll look at that. (I notice that the Intel code references something > they call Linux-OpenIB. Hopefully that looks about the same and the > glue needed to support non-Mellanox drivers isn't too difficult?) OpenIB is perhaps the reference to the IB code in Linux kernel proper plus userspace libraries from rdma-core. This is what was forked/grown from OFED. Intel put efforts into the iWARP, which is sort of alternative for RoCEv2. It has RFCs and works over TCP AFAIR, which causes problems for it. > > Btw, if anyone is interested in taking a more active involvement in this, > they are more than welcome to do so. (I'm going to be starting where I > understand things in the krpc/nfs. I'm not looking forward to porting rxe, > but will probably end up there. I have already had one offer w.r.t. access > to a lab that includes Mellanox hardware, but I don't know if remote > debugging will be practical yet.) > > rick > > > > > The IB implementation for us is still called OFED for historical reasons, > > and it is located in sys/ofed. > > > > > > > > As for testing, I am planning on hacking away at one of the RDMA > > > in software drivers in Linux to get it working well enough to use for > > > testing. Whatever seems to be easiest to get kinda working. > > Yes rxe driver is the sw RoCE v2 implementation. We looked at the > > amount of work to port it. Its size is ~12 kLoC, which is compatible > > with libibverbs (userspace core infiniband interface). > > > > > > > > Anyhow, any comments would be appreciated, rick > > > ps: I did a bunch of cc's trying to get to the people that might know > > > how the RDMA drivers work and what calls would do the above for > > > them. > >