RFC: NFS over RDMA

From: Rick Macklem <rick.macklem_at_gmail.com>
Date: Sat, 01 Nov 2025 20:11:02 UTC
Hi,

I've had NFS over RDMA on my todo list for a very loonnnggg
time. I've avoided it because I haven't had a way to test it,
but I'm now going to start working on it. (A bunch of this work
is already done for NFS-over-TLS which added code for handling
M_EXTPG mbufs.)

From RFC-8166, there appears to be 4 operations the krpc
needs to do:
send-rdma - Send on the payload stream (sending messages that
                    are kept in order).
recv-rdma - Receive the above.
ddp-write - Do a write of DDP data.
ddp-read - Do a read of DDP data.

So, here is how I see the krpc doing this.
An NFS write RPC for example:
- The NFS client code packages the Write RPC XDR as follows:
  - 1 or more mbufs/mbuf_clusters of XDR for the NFS arguments
     that precede the write data.
  - an mbuf that indicates "start of ddp-read". (Maybe use M_PROTO1?)
  - 1 or more M_EXTPG mbugs with page(s) loaded with the data to be
    written.
  - 0 or more mbufs/mbuf_clusters with additional RPC request XDR.

This would be passed to the krpc which would...
 - the mbufs up to "start of ddp" in the payload stream.
 - Would specify a ddp-read for the pages from the M_EXTPG mbufs
   and send that in the payload stream.
 - send the remaining mbufs/mbuf_clusters in the payload stream

The NFS server end would process the received payload stream,
putting the non-ddp stuff in mbufs/mbuf_clusters.
It would do the ddp-read of the data into anonymous pages it allocates
and would associate these with M_EXTPG mbufs.
It would put any remaining payload stream stuff for the RPC message in
additional mbufs/mbuf_clusters.
--> Call the NFS server with the mbuf list for processing.
     - When the NFS server gets to the write data (in M_EXTPG mbufs)
       it would set up a uio/iovec for the pages and call VOP_WRITE().

Now, the above is straightforward for me, since I know the NFS and
krpc code fairly well.
But that is where my expertise ends.

So, what kind of calls do the drivers provide to send and receive
what RFC-8166 calls the payload stream?

And what kind of calls do the drivers provide to write and read DDP
chunks?

Also, if the above sounds way off the mark, please let me know.

As for testing, I am planning on hacking away at one of the RDMA
in software drivers in Linux to get it working well enough to use for
testing. Whatever seems to be easiest to get kinda working.

Anyhow, any comments would be appreciated, rick
ps: I did a bunch of cc's trying to get to the people that might know
      how the RDMA drivers work and what calls would do the above for
      them.