Re: 100Gb performance

From: Rodney W. Grimes <freebsd-rwg_at_gndrsh.dnsmgr.net>
Date: Thu, 26 Jun 2025 03:41:20 UTC
> On Tue, Jun 24, 2025 at 4:39?PM Rodney W. Grimes
> <freebsd-rwg@gndrsh.dnsmgr.net> wrote:
> >
> > > On Tue, Jun 24, 2025 at 4:45?PM Rick Macklem <rick.macklem@gmail.com> wrote:
> > >
> > > >
> > > > Here's how I'd configure a client (assuming it's a fairly beefy system):
> > > > In /boot/loader.conf:
> > > > vfs.maxbcachebuf=1048576
> > > >
> > > > In /etc/sysctl.conf:
> > > > kern.ipc.maxsockbuf=47370024 (or larger)
> > > > vfs.nfs.iodmax=64
> > > >
> > > > Then I'd use these mount options (along with whatever you normally use,
> > > > except don't specify rsize, wsize since it should use whatever the server
> > > > supports):
> > > > nconnect=8,nocto,readahead=8,wcommitsize=67108864 (or larger)
> > > >
> > > > To test write rate, I'd:
> > > > # dd if=/dev/zero of=<file on mount> bs=1M count=10240
> > > > for reading
> > > > # dd if=<file on mount> of=/dev/null bs=1M
> > > > (but umount/mount between the two "dd"s, so nothing is cached
> > > > in the client's buffer cache)
> > > >
> > > > If you are stuck at 1.2Gbytes/sec, there's some bottleneck, but
> > > > I can't say where.
> > > >
> > > > rick
> > > > ps: The newnfs threads to write-behind and read-ahead, so there
> > > >      is some parallelism for the "dd".
> > > >
> > > >
> > > Hi,
> > >
> > > Ok let?s try that all those parameters (running June 2025 stableweek) :
> > >
> > > On server and client, /etc/sysctl.conf configured with a:
> > > kern.ipc.maxsockbuf=33554432
> > > net.inet.tcp.recvbuf_max=33554432
> > > net.inet.tcp.sendbuf_max=33554432
> > > net.inet.tcp.recvspace=1048576
> > > net.inet.tcp.sendspace=524288
> > > vfs.nfs.iodmax=64
> >
> > I suggested doubling, or quadropling the defaults of 2mb,
> > why did you only try 1.5 times?
> I think he increased it to 32Mbyes and not 3?

Your correct, all the double digits seems to have confused my parsing.
I was how ever interested in the data points that doing a 1, 2, 4, 8x
would of given us.

I am also not sure if the recvspace is large enough.

> 
> Here's my way of gestimating it...
> - For a 1msec transit time (it's probably less).
>   100Mbits/msec * 1msec = 100Mbits / 8 = 12Mbytes
>   - This is what it takes to fill the bit pipe.
>   - It takes the same time for an ACK to transit in the
>     opposite direction, so double it for that.
> Put another way, get the RTT via ping and then:
> 100Mbits/msec * RTT(msec) / 8
> However, this doesn't account for delay in the server
> processing the TCP segment and sending an ACK,
> so I'd bump the above up by a bunch.
> 
> Does the above look reasonable? rick

Reasonable, expect that your 1msec is probably far
to long for modern connections at 1Gb/s and above.
Even 1Gb/s ethernet has a <400uS ping (so RTT) when
doing 1500 byte packets, and <200uS when doing minimal
length.

> 
> >
> > >
> > > Server side:
> > > nfs_server_enable="YES"
> > > nfsv4_server_enable="YES"
> > > nfsv4_server_only="YES"
> > > nfs_server_maxio="1048576"
> > > With correctly applied sysctl:
> > > root@server:~ # sysctl vfs.nfsd.srvmaxio
> > > vfs.nfsd.srvmaxio: 1048576
> > > root@server:~ # sysctl vfs.nfs.iodmax
> > > vfs.nfs.iodmax: 64
> > >
> > > First, just generating the server disk speed to be used as reference:
> > > root@server:~ # dd if=/dev/zero of=/tmp/nfs/data bs=1M count=20480
> > > 20480+0 records in
> > > 20480+0 records out
> > > 21474836480 bytes transferred in 3.477100 secs (6176076082 bytes/sec)
> > > root@server:~ # units -t '6176076082 bytes' gigabit
> > > 49.408609
> > >
> > > So here, reaching about 40Gb/s with NFS will be the target.
> > >
> > > But before the NFS test, a simple iperf3 test between client and server
> > > with 16?sessions (same as with nconnect):
> > > root@client:~ # iperf3 -c 1.1.1.30 --parallel 16
> >                                      ^^^^^^^^^^^^^^
> > > [SUM]   0.00-10.00  sec  99.1 GBytes  85.1 Gbits/sec  81693  sender
> > >
> > > The 100Gb/s link is here and seems to be working fine with iperf3.
> >
> > I am going to assume you kept cranking up the parallel count until
> > you reached what you feel to be "working fine".  I would be very intrested
> > in the data at 1, 2, 4, 8 and 16, and not just the final number
> > but the actual "test data" as output during the run of iperf3.
> > Especially of value is what the window sizes look like.
> >
> > Also what is the ping time between the client and server?
> >         ping -q -s 1500 -c 10 servername
> >
> > >
> > > On the client side, the NFS test now:
> > > root@client:~ # mount -t nfs -o
> > > noatime,nfsv4,nconnect=16,wcommitsize=67108864,readahead=8,nocto 1.1.1.30:/nfs
> > > /tmp/nfs/
> > > root@client:~ # nfsstat -m
> > > 1.1.1.30:/nfs on /tmp/nfs
> > > nfsv4,minorversion=2,tcp,resvport,nconnect=16,hard,nocto,sec=sys,acdirmin=3,acdirmax=60,acregmin=5,acregmax=60,nametimeo=60,negnametimeo=60,rsize=65536,wsize=65536,readdirsize=65536,readahead=8,wcommitsize=67108864,timeout=120,retrans=2147483647
> > >
> > > => Notice here that negotiated rsize and wsize haven't improved since the
> > > bump of vfs.nfsd.srvmaxio on server side. Shouldn't those values be a lot
> > > bigger at this stage ?
> > >
> > > root@cliet:~ # dd if=/dev/zero of=/tmp/nfs/data bs=1M count=20480
> > > 20480+0 records in
> > > 20480+0 records out
> > > 21474836480 bytes transferred in 9.591257 secs (2239001240 bytes/sec)
> > > root@client:~ # units -t '2239001240 bytes' gigabit
> > > 17.91201
> > > root@client:~ # umount /tmp/nfs/
> > > root@client:~ # mount -t nfs -o
> > > noatime,nfsv4,nconnect=16,wcommitsize=67108864,readahead=8,nocto 1.1.1.30:/nfs
> > > /tmp/nfs/
> > > root@client:~ # dd of=/dev/zero if=/tmp/nfs/data bs=1M count=20480
> > > 20480+0 records in
> > > 20480+0 records out
> > > 21474836480 bytes transferred in 6.900937 secs (3111872643 bytes/sec)
> > > root@client:~ # units -t '3111872643 bytes' gigabit
> > > 24.894981
> > >
> > > So with NFS I?m able to read at about 25Gb/s and write at 18Gb/s.
> > >
> > > The output of a "pmcstat -TS cpu_clk_unhalted.thread_p -w1" on the client
> > > during this test shows a high level of invlop_handler:
> > >
> > > PMC: [cpu_clk_unhalted.thread_p] Samples: 9730 (100.0%) , 0 unresolved
> > >
> > > %SAMP IMAGE      FUNCTION                       CALLERS
> > >  31.2 kernel     invlop_handler
> > >  24.9 kernel     cpu_idle                       sched_idletd
> > >  11.4 kernel     Xinvlop
> > >   1.8 kernel     copyin_smap_erms               uiomove_faultflag
> > >   1.8 kernel     memmove_erms                   nfsm_uiombuf
> > >   1.5 kernel     cpu_search_highest             cpu_search_highest
> > >   1.3 kernel     mb_free_ext                    m_free
> > >
> > > And on the server:
> > >
> > > PMC: [cpu_clk_unhalted.thread_p] Samples: 4093 (100.0%) , 0 unresolved
> > >
> > > %SAMP IMAGE      FUNCTION                       CALLERS
> > >   7.8 zfs.ko     abd_cmp_zero_off_cb            abd_iterate_func
> > >   7.7 kernel     memmove_erms                   uiomove_faultflag
> > >   4.9 kernel     cpu_idle                       sched_idletd
> > >   4.8 kernel     mlx5e_rx_cq_comp               mlx5_cq_completion
> > >   3.4 kernel     cpu_search_highest             cpu_search_highest
> > >   3.4 kernel     memset_erms                    dbuf_read
> > >   3.0 kernel     mb_ctor_pack                   uma_zalloc_arg
> > >   2.6 kernel     soreceive_generic_locked       soreceive_generic
> > >   2.2 kernel     lock_delay                     dbuf_find
> > >
> > > Regards,
> > > Olivier
> >
> > --
> > Rod Grimes                                                 rgrimes@freebsd.org
> 
> 

-- 
Rod Grimes                                                 rgrimes@freebsd.org