Re: RFC: NFS over RDMA
- Reply: Rick Macklem : "Re: RFC: NFS over RDMA"
- In reply to: Rick Macklem : "RFC: NFS over RDMA"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Sat, 01 Nov 2025 20:49:54 UTC
Added Slava Schwartsman. On Sat, Nov 01, 2025 at 01:11:02PM -0700, Rick Macklem wrote: > Hi, > > I've had NFS over RDMA on my todo list for a very loonnnggg > time. I've avoided it because I haven't had a way to test it, > but I'm now going to start working on it. (A bunch of this work > is already done for NFS-over-TLS which added code for handling > M_EXTPG mbufs.) > > >From RFC-8166, there appears to be 4 operations the krpc > needs to do: > send-rdma - Send on the payload stream (sending messages that > are kept in order). > recv-rdma - Receive the above. > ddp-write - Do a write of DDP data. > ddp-read - Do a read of DDP data. > > So, here is how I see the krpc doing this. > An NFS write RPC for example: > - The NFS client code packages the Write RPC XDR as follows: > - 1 or more mbufs/mbuf_clusters of XDR for the NFS arguments > that precede the write data. > - an mbuf that indicates "start of ddp-read". (Maybe use M_PROTO1?) > - 1 or more M_EXTPG mbugs with page(s) loaded with the data to be > written. > - 0 or more mbufs/mbuf_clusters with additional RPC request XDR. > > This would be passed to the krpc which would... > - the mbufs up to "start of ddp" in the payload stream. > - Would specify a ddp-read for the pages from the M_EXTPG mbufs > and send that in the payload stream. > - send the remaining mbufs/mbuf_clusters in the payload stream > > The NFS server end would process the received payload stream, > putting the non-ddp stuff in mbufs/mbuf_clusters. > It would do the ddp-read of the data into anonymous pages it allocates > and would associate these with M_EXTPG mbufs. > It would put any remaining payload stream stuff for the RPC message in > additional mbufs/mbuf_clusters. > --> Call the NFS server with the mbuf list for processing. > - When the NFS server gets to the write data (in M_EXTPG mbufs) > it would set up a uio/iovec for the pages and call VOP_WRITE(). > > Now, the above is straightforward for me, since I know the NFS and > krpc code fairly well. > But that is where my expertise ends. > > So, what kind of calls do the drivers provide to send and receive > what RFC-8166 calls the payload stream? > > And what kind of calls do the drivers provide to write and read DDP > chunks? > > Also, if the above sounds way off the mark, please let me know. What you need is, most likely, the infiniband API or KPI to handle RDMA. It is driver-independent, same as for ip NFS you use system IP stack and not call to ethernet drivers. In fact, most likely the transport used would be not native IB, but IB over UDP (RoCE v2). IB verbs, which is the official interface for both kernel and user mode, are not well documented. An overview is provided by the document titled "RDMA Aware Networks Programming User Manual", which should be google-able. Otherwise, the Infiniband specication is the reference. The IB implementation for us is still called OFED for historical reasons, and it is located in sys/ofed. > > As for testing, I am planning on hacking away at one of the RDMA > in software drivers in Linux to get it working well enough to use for > testing. Whatever seems to be easiest to get kinda working. Yes rxe driver is the sw RoCE v2 implementation. We looked at the amount of work to port it. Its size is ~12 kLoC, which is compatible with libibverbs (userspace core infiniband interface). > > Anyhow, any comments would be appreciated, rick > ps: I did a bunch of cc's trying to get to the people that might know > how the RDMA drivers work and what calls would do the above for > them.