Re: 100Gb performance

From: Mark Saad <nonesuch_at_longcount.org>
Date: Mon, 23 Jun 2025 14:36:32 UTC
On Sat, Jun 21, 2025 at 9:31 PM Rodney W. Grimes <
freebsd-rwg@gndrsh.dnsmgr.net> wrote:

> Have the TCP inflight buffers been tuned for the BDP of this link?
>
> When calculating delay do not just assume the one way link delay
> as what you really need  is the delay until the sender gets to
> processing an ack.
>
> I would be interested in the values of the following, and
> the effect of doubling to quadrupling them has on the
> performance.  (I believe I have shown the defaults)
>
> kern.ipc.maxsockbuf=2097152
> net.inet.tcp.recvbuf_max=2097152
> net.inet.tcp.sendbuf_max=2097152
> net.inet.tcp.recvspace=65536
> net.inet.tcp.sendspace=32768
>
> I understand you can't modfy the server side, but I think you
> can extract the values being used there.
>
> What is the resulting Window Scale being used on the TCP TWH?
>
> > On Thu, Jun 19, 2025 at 2:34?PM Olivier Cochard-Labb?
> > <olivier@freebsd.org> wrote:
> > >
> > >
> > >
> > > On Thu, Jun 19, 2025 at 4:31?PM Rick Macklem <rick.macklem@gmail.com>
> wrote:
> > >
> > >>
> > >> There is the "nconnect" mount option. It might help here.
> > >>
> > >
> > > Interesting!
> > >
> > > Let?s try:
> > >
> > > Server side:
> > > ```
> > > mkdir /tmp/nfs
> > > mount -t tmpfs tmpfs /tmp/nfs
> > > chmod 777 /tmp/nfs/
> > > cat > /etc/exports <<EOF
> > > V4: /tmp
> > > /tmp/nfs -network 1.1.1.0/24
> > > EOF
> > > sysrc nfs_server_enable=YES
> > > sysrc nfsv4_server_enable=YES
> > > sysrc nfsv4_server_only=YES
> > > service nfsd start
> > > ```
> > >
> > > Client side:
> > > ```
> > > mkdir /tmp/nfs
> > > sysrc nfs_client_enable=YES
> > > service nfsclient start
> > > ```
> > >
> > > Now testing standard speed:
> > > ```
> > > # mount -t nfs -o noatime,nfsv4 1.1.1.30:/nfs /tmp/nfs/
> > > # netstat -an -f inet -p tcp | grep 2049 | wc -l
> > >        1
> > > # dd if=/dev/zero of=/tmp/nfs/test bs=1G count=10
> > > 10+0 records in
> > > 10+0 records out
> > > 10737418240 bytes transferred in 8.526794 secs (1259256159 bytes/sec)
> > > # rm /tmp/nfs/test
> > > # umount /tmp/nfs
> > > ```
> > >
> > > And with nconnect=16:
> > > ```
> > > # mount -t nfs -o noatime,nfsv4,nconnect=16 1.1.1.30:/nfs /tmp/nfs/
> > > # dd if=/dev/zero of=/tmp/nfs/test bs=1G count=10
> > > 10+0 records in
> > > 10+0 records out
> > > 10737418240 bytes transferred in 8.633871 secs (1243638980 bytes/sec)
> > > # rm /tmp/nfs/test
> > > # netstat -an -f inet -p tcp | grep 2049 | wc -l
> > >       16
> > > ```
> > >
> > > => No difference here, but 16 output queues were correctly used with
> nconnect=16.
> > > How is load-sharing done with NFS nconnect ?
> > > I?ve tested with benchmarks/fio using parallel jobs and I don?t see
> any improvement too.
> > Here's a few other things you can try..
> > On the server:
> > - add nfs_server_maxio=1048576 to /etc/rc.conf.
> >
> > On the client:
> > - put vfs.maxbcachebuf=1048576 in /boot/loader.conf
> > - use "wcommitsize=<some large value>" as an additional mount option.
> >
> > On both client and server, bump kern.ipc.maxsockbuf up a bunch.
> >
> > Once you do the mount do
> > # nfsstat -m
> > on the client and you should see the rsize/wsize set to 1048576
> > and a large value for wcommitsize
> >
> > For reading, you should also use "readahead=8" as a mount option.
> >
> > Also, if you can turn down (or turn off) interrupt moderation on the
> > NIC driver, try that. (Interrupt moderation is great for data streaming
> > in one direction but is not so good for NFS, which consists of
> bidirectional
> > traffic of mostly small RPC messages. Every Write gets a small reply
> message
> > in the server->client direction to complete the Write and delay
> processing
> > these small received messages will slow NFS down.)
> >
> > rick
> >
> > >
> > > Regards,
> > > Olivier
> >
> >
> >
>
> --
> Rod Grimes
> rgrimes@freebsd.org
>

Ok I was thinking about this over the weekend and I suspect we would still
need concurrency in what is running on top of the NFS mount with nconnect .
So What I was thinking would be a good test, was make -j4 buildworld in a
nfs mounted /usr/src and /usr/obj with the mounts using  nconnect=4 .
This should make the 4 make jobs run concurrently which in turn should use
the 4 nfs queues ( I am not sure what to call them ) which , should then
work out to 4 nic queues servicing them and hopefully this is running on 4
free cores, in a 8 core box when the scheduler is doing it thing .

Time that, and monitor the network traffic for the build time. Then repeat
this entire thing using mounts that do not use nconnect .

I have access to some 25G hardware , but only with vms , so I can try this
but I am not exactly sure if its apples to apples.

Thoughts ?

-- 
mark saad | nonesuch@longcount.org