Re: RFC: NFS over RDMA
- Reply: Konstantin Belousov : "Re: RFC: NFS over RDMA"
- In reply to: Konstantin Belousov : "Re: RFC: NFS over RDMA"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Sat, 01 Nov 2025 21:03:59 UTC
On Sat, Nov 1, 2025 at 1:50 PM Konstantin Belousov <kib@freebsd.org> wrote: > > Added Slava Schwartsman. > > On Sat, Nov 01, 2025 at 01:11:02PM -0700, Rick Macklem wrote: > > Hi, > > > > I've had NFS over RDMA on my todo list for a very loonnnggg > > time. I've avoided it because I haven't had a way to test it, > > but I'm now going to start working on it. (A bunch of this work > > is already done for NFS-over-TLS which added code for handling > > M_EXTPG mbufs.) > > > > >From RFC-8166, there appears to be 4 operations the krpc > > needs to do: > > send-rdma - Send on the payload stream (sending messages that > > are kept in order). > > recv-rdma - Receive the above. > > ddp-write - Do a write of DDP data. > > ddp-read - Do a read of DDP data. > > > > So, here is how I see the krpc doing this. > > An NFS write RPC for example: > > - The NFS client code packages the Write RPC XDR as follows: > > - 1 or more mbufs/mbuf_clusters of XDR for the NFS arguments > > that precede the write data. > > - an mbuf that indicates "start of ddp-read". (Maybe use M_PROTO1?) > > - 1 or more M_EXTPG mbugs with page(s) loaded with the data to be > > written. > > - 0 or more mbufs/mbuf_clusters with additional RPC request XDR. > > > > This would be passed to the krpc which would... > > - the mbufs up to "start of ddp" in the payload stream. > > - Would specify a ddp-read for the pages from the M_EXTPG mbufs > > and send that in the payload stream. > > - send the remaining mbufs/mbuf_clusters in the payload stream > > > > The NFS server end would process the received payload stream, > > putting the non-ddp stuff in mbufs/mbuf_clusters. > > It would do the ddp-read of the data into anonymous pages it allocates > > and would associate these with M_EXTPG mbufs. > > It would put any remaining payload stream stuff for the RPC message in > > additional mbufs/mbuf_clusters. > > --> Call the NFS server with the mbuf list for processing. > > - When the NFS server gets to the write data (in M_EXTPG mbufs) > > it would set up a uio/iovec for the pages and call VOP_WRITE(). > > > > Now, the above is straightforward for me, since I know the NFS and > > krpc code fairly well. > > But that is where my expertise ends. > > > > So, what kind of calls do the drivers provide to send and receive > > what RFC-8166 calls the payload stream? > > > > And what kind of calls do the drivers provide to write and read DDP > > chunks? > > > > Also, if the above sounds way off the mark, please let me know. > > What you need is, most likely, the infiniband API or KPI to handle > RDMA. It is driver-independent, same as for ip NFS you use system IP > stack and not call to ethernet drivers. In fact, most likely the > transport used would be not native IB, but IB over UDP (RoCE v2). > > IB verbs, which is the official interface for both kernel and user mode, > are not well documented. An overview is provided by the document > titled "RDMA Aware Networks Programming User Manual", which should > be google-able. Otherwise, the Infiniband specication is the reference. Thanks. I'll look at that. (I notice that the Intel code references something they call Linux-OpenIB. Hopefully that looks about the same and the glue needed to support non-Mellanox drivers isn't too difficult?) Btw, if anyone is interested in taking a more active involvement in this, they are more than welcome to do so. (I'm going to be starting where I understand things in the krpc/nfs. I'm not looking forward to porting rxe, but will probably end up there. I have already had one offer w.r.t. access to a lab that includes Mellanox hardware, but I don't know if remote debugging will be practical yet.) rick > > The IB implementation for us is still called OFED for historical reasons, > and it is located in sys/ofed. > > > > > As for testing, I am planning on hacking away at one of the RDMA > > in software drivers in Linux to get it working well enough to use for > > testing. Whatever seems to be easiest to get kinda working. > Yes rxe driver is the sw RoCE v2 implementation. We looked at the > amount of work to port it. Its size is ~12 kLoC, which is compatible > with libibverbs (userspace core infiniband interface). > > > > > Anyhow, any comments would be appreciated, rick > > ps: I did a bunch of cc's trying to get to the people that might know > > how the RDMA drivers work and what calls would do the above for > > them. >