Re: 100Gb performance

From: Rodney W. Grimes <freebsd-rwg_at_gndrsh.dnsmgr.net>
Date: Tue, 24 Jun 2025 23:38:57 UTC
> On Tue, Jun 24, 2025 at 4:45?PM Rick Macklem <rick.macklem@gmail.com> wrote:
> 
> >
> > Here's how I'd configure a client (assuming it's a fairly beefy system):
> > In /boot/loader.conf:
> > vfs.maxbcachebuf=1048576
> >
> > In /etc/sysctl.conf:
> > kern.ipc.maxsockbuf=47370024 (or larger)
> > vfs.nfs.iodmax=64
> >
> > Then I'd use these mount options (along with whatever you normally use,
> > except don't specify rsize, wsize since it should use whatever the server
> > supports):
> > nconnect=8,nocto,readahead=8,wcommitsize=67108864 (or larger)
> >
> > To test write rate, I'd:
> > # dd if=/dev/zero of=<file on mount> bs=1M count=10240
> > for reading
> > # dd if=<file on mount> of=/dev/null bs=1M
> > (but umount/mount between the two "dd"s, so nothing is cached
> > in the client's buffer cache)
> >
> > If you are stuck at 1.2Gbytes/sec, there's some bottleneck, but
> > I can't say where.
> >
> > rick
> > ps: The newnfs threads to write-behind and read-ahead, so there
> >      is some parallelism for the "dd".
> >
> >
> Hi,
> 
> Ok let?s try that all those parameters (running June 2025 stableweek) :
> 
> On server and client, /etc/sysctl.conf configured with a:
> kern.ipc.maxsockbuf=33554432
> net.inet.tcp.recvbuf_max=33554432
> net.inet.tcp.sendbuf_max=33554432
> net.inet.tcp.recvspace=1048576
> net.inet.tcp.sendspace=524288
> vfs.nfs.iodmax=64

I suggested doubling, or quadropling the defaults of 2mb,
why did you only try 1.5 times?

> 
> Server side:
> nfs_server_enable="YES"
> nfsv4_server_enable="YES"
> nfsv4_server_only="YES"
> nfs_server_maxio="1048576"
> With correctly applied sysctl:
> root@server:~ # sysctl vfs.nfsd.srvmaxio
> vfs.nfsd.srvmaxio: 1048576
> root@server:~ # sysctl vfs.nfs.iodmax
> vfs.nfs.iodmax: 64
> 
> First, just generating the server disk speed to be used as reference:
> root@server:~ # dd if=/dev/zero of=/tmp/nfs/data bs=1M count=20480
> 20480+0 records in
> 20480+0 records out
> 21474836480 bytes transferred in 3.477100 secs (6176076082 bytes/sec)
> root@server:~ # units -t '6176076082 bytes' gigabit
> 49.408609
> 
> So here, reaching about 40Gb/s with NFS will be the target.
> 
> But before the NFS test, a simple iperf3 test between client and server
> with 16?sessions (same as with nconnect):
> root@client:~ # iperf3 -c 1.1.1.30 --parallel 16
                                     ^^^^^^^^^^^^^^
> [SUM]   0.00-10.00  sec  99.1 GBytes  85.1 Gbits/sec  81693  sender
> 
> The 100Gb/s link is here and seems to be working fine with iperf3.

I am going to assume you kept cranking up the parallel count until
you reached what you feel to be "working fine".  I would be very intrested
in the data at 1, 2, 4, 8 and 16, and not just the final number
but the actual "test data" as output during the run of iperf3.
Especially of value is what the window sizes look like.

Also what is the ping time between the client and server?
	ping -q -s 1500 -c 10 servername

> 
> On the client side, the NFS test now:
> root@client:~ # mount -t nfs -o
> noatime,nfsv4,nconnect=16,wcommitsize=67108864,readahead=8,nocto 1.1.1.30:/nfs
> /tmp/nfs/
> root@client:~ # nfsstat -m
> 1.1.1.30:/nfs on /tmp/nfs
> nfsv4,minorversion=2,tcp,resvport,nconnect=16,hard,nocto,sec=sys,acdirmin=3,acdirmax=60,acregmin=5,acregmax=60,nametimeo=60,negnametimeo=60,rsize=65536,wsize=65536,readdirsize=65536,readahead=8,wcommitsize=67108864,timeout=120,retrans=2147483647
> 
> => Notice here that negotiated rsize and wsize haven't improved since the
> bump of vfs.nfsd.srvmaxio on server side. Shouldn't those values be a lot
> bigger at this stage ?
> 
> root@cliet:~ # dd if=/dev/zero of=/tmp/nfs/data bs=1M count=20480
> 20480+0 records in
> 20480+0 records out
> 21474836480 bytes transferred in 9.591257 secs (2239001240 bytes/sec)
> root@client:~ # units -t '2239001240 bytes' gigabit
> 17.91201
> root@client:~ # umount /tmp/nfs/
> root@client:~ # mount -t nfs -o
> noatime,nfsv4,nconnect=16,wcommitsize=67108864,readahead=8,nocto 1.1.1.30:/nfs
> /tmp/nfs/
> root@client:~ # dd of=/dev/zero if=/tmp/nfs/data bs=1M count=20480
> 20480+0 records in
> 20480+0 records out
> 21474836480 bytes transferred in 6.900937 secs (3111872643 bytes/sec)
> root@client:~ # units -t '3111872643 bytes' gigabit
> 24.894981
> 
> So with NFS I?m able to read at about 25Gb/s and write at 18Gb/s.
> 
> The output of a "pmcstat -TS cpu_clk_unhalted.thread_p -w1" on the client
> during this test shows a high level of invlop_handler:
> 
> PMC: [cpu_clk_unhalted.thread_p] Samples: 9730 (100.0%) , 0 unresolved
> 
> %SAMP IMAGE      FUNCTION                       CALLERS
>  31.2 kernel     invlop_handler
>  24.9 kernel     cpu_idle                       sched_idletd
>  11.4 kernel     Xinvlop
>   1.8 kernel     copyin_smap_erms               uiomove_faultflag
>   1.8 kernel     memmove_erms                   nfsm_uiombuf
>   1.5 kernel     cpu_search_highest             cpu_search_highest
>   1.3 kernel     mb_free_ext                    m_free
> 
> And on the server:
> 
> PMC: [cpu_clk_unhalted.thread_p] Samples: 4093 (100.0%) , 0 unresolved
> 
> %SAMP IMAGE      FUNCTION                       CALLERS
>   7.8 zfs.ko     abd_cmp_zero_off_cb            abd_iterate_func
>   7.7 kernel     memmove_erms                   uiomove_faultflag
>   4.9 kernel     cpu_idle                       sched_idletd
>   4.8 kernel     mlx5e_rx_cq_comp               mlx5_cq_completion
>   3.4 kernel     cpu_search_highest             cpu_search_highest
>   3.4 kernel     memset_erms                    dbuf_read
>   3.0 kernel     mb_ctor_pack                   uma_zalloc_arg
>   2.6 kernel     soreceive_generic_locked       soreceive_generic
>   2.2 kernel     lock_delay                     dbuf_find
> 
> Regards,
> Olivier

-- 
Rod Grimes                                                 rgrimes@freebsd.org