Re: 100Gb performance
- Reply: Rick Macklem : "Re: 100Gb performance"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Tue, 24 Jun 2025 23:38:57 UTC
> On Tue, Jun 24, 2025 at 4:45?PM Rick Macklem <rick.macklem@gmail.com> wrote:
>
> >
> > Here's how I'd configure a client (assuming it's a fairly beefy system):
> > In /boot/loader.conf:
> > vfs.maxbcachebuf=1048576
> >
> > In /etc/sysctl.conf:
> > kern.ipc.maxsockbuf=47370024 (or larger)
> > vfs.nfs.iodmax=64
> >
> > Then I'd use these mount options (along with whatever you normally use,
> > except don't specify rsize, wsize since it should use whatever the server
> > supports):
> > nconnect=8,nocto,readahead=8,wcommitsize=67108864 (or larger)
> >
> > To test write rate, I'd:
> > # dd if=/dev/zero of=<file on mount> bs=1M count=10240
> > for reading
> > # dd if=<file on mount> of=/dev/null bs=1M
> > (but umount/mount between the two "dd"s, so nothing is cached
> > in the client's buffer cache)
> >
> > If you are stuck at 1.2Gbytes/sec, there's some bottleneck, but
> > I can't say where.
> >
> > rick
> > ps: The newnfs threads to write-behind and read-ahead, so there
> > is some parallelism for the "dd".
> >
> >
> Hi,
>
> Ok let?s try that all those parameters (running June 2025 stableweek) :
>
> On server and client, /etc/sysctl.conf configured with a:
> kern.ipc.maxsockbuf=33554432
> net.inet.tcp.recvbuf_max=33554432
> net.inet.tcp.sendbuf_max=33554432
> net.inet.tcp.recvspace=1048576
> net.inet.tcp.sendspace=524288
> vfs.nfs.iodmax=64
I suggested doubling, or quadropling the defaults of 2mb,
why did you only try 1.5 times?
>
> Server side:
> nfs_server_enable="YES"
> nfsv4_server_enable="YES"
> nfsv4_server_only="YES"
> nfs_server_maxio="1048576"
> With correctly applied sysctl:
> root@server:~ # sysctl vfs.nfsd.srvmaxio
> vfs.nfsd.srvmaxio: 1048576
> root@server:~ # sysctl vfs.nfs.iodmax
> vfs.nfs.iodmax: 64
>
> First, just generating the server disk speed to be used as reference:
> root@server:~ # dd if=/dev/zero of=/tmp/nfs/data bs=1M count=20480
> 20480+0 records in
> 20480+0 records out
> 21474836480 bytes transferred in 3.477100 secs (6176076082 bytes/sec)
> root@server:~ # units -t '6176076082 bytes' gigabit
> 49.408609
>
> So here, reaching about 40Gb/s with NFS will be the target.
>
> But before the NFS test, a simple iperf3 test between client and server
> with 16?sessions (same as with nconnect):
> root@client:~ # iperf3 -c 1.1.1.30 --parallel 16
^^^^^^^^^^^^^^
> [SUM] 0.00-10.00 sec 99.1 GBytes 85.1 Gbits/sec 81693 sender
>
> The 100Gb/s link is here and seems to be working fine with iperf3.
I am going to assume you kept cranking up the parallel count until
you reached what you feel to be "working fine". I would be very intrested
in the data at 1, 2, 4, 8 and 16, and not just the final number
but the actual "test data" as output during the run of iperf3.
Especially of value is what the window sizes look like.
Also what is the ping time between the client and server?
ping -q -s 1500 -c 10 servername
>
> On the client side, the NFS test now:
> root@client:~ # mount -t nfs -o
> noatime,nfsv4,nconnect=16,wcommitsize=67108864,readahead=8,nocto 1.1.1.30:/nfs
> /tmp/nfs/
> root@client:~ # nfsstat -m
> 1.1.1.30:/nfs on /tmp/nfs
> nfsv4,minorversion=2,tcp,resvport,nconnect=16,hard,nocto,sec=sys,acdirmin=3,acdirmax=60,acregmin=5,acregmax=60,nametimeo=60,negnametimeo=60,rsize=65536,wsize=65536,readdirsize=65536,readahead=8,wcommitsize=67108864,timeout=120,retrans=2147483647
>
> => Notice here that negotiated rsize and wsize haven't improved since the
> bump of vfs.nfsd.srvmaxio on server side. Shouldn't those values be a lot
> bigger at this stage ?
>
> root@cliet:~ # dd if=/dev/zero of=/tmp/nfs/data bs=1M count=20480
> 20480+0 records in
> 20480+0 records out
> 21474836480 bytes transferred in 9.591257 secs (2239001240 bytes/sec)
> root@client:~ # units -t '2239001240 bytes' gigabit
> 17.91201
> root@client:~ # umount /tmp/nfs/
> root@client:~ # mount -t nfs -o
> noatime,nfsv4,nconnect=16,wcommitsize=67108864,readahead=8,nocto 1.1.1.30:/nfs
> /tmp/nfs/
> root@client:~ # dd of=/dev/zero if=/tmp/nfs/data bs=1M count=20480
> 20480+0 records in
> 20480+0 records out
> 21474836480 bytes transferred in 6.900937 secs (3111872643 bytes/sec)
> root@client:~ # units -t '3111872643 bytes' gigabit
> 24.894981
>
> So with NFS I?m able to read at about 25Gb/s and write at 18Gb/s.
>
> The output of a "pmcstat -TS cpu_clk_unhalted.thread_p -w1" on the client
> during this test shows a high level of invlop_handler:
>
> PMC: [cpu_clk_unhalted.thread_p] Samples: 9730 (100.0%) , 0 unresolved
>
> %SAMP IMAGE FUNCTION CALLERS
> 31.2 kernel invlop_handler
> 24.9 kernel cpu_idle sched_idletd
> 11.4 kernel Xinvlop
> 1.8 kernel copyin_smap_erms uiomove_faultflag
> 1.8 kernel memmove_erms nfsm_uiombuf
> 1.5 kernel cpu_search_highest cpu_search_highest
> 1.3 kernel mb_free_ext m_free
>
> And on the server:
>
> PMC: [cpu_clk_unhalted.thread_p] Samples: 4093 (100.0%) , 0 unresolved
>
> %SAMP IMAGE FUNCTION CALLERS
> 7.8 zfs.ko abd_cmp_zero_off_cb abd_iterate_func
> 7.7 kernel memmove_erms uiomove_faultflag
> 4.9 kernel cpu_idle sched_idletd
> 4.8 kernel mlx5e_rx_cq_comp mlx5_cq_completion
> 3.4 kernel cpu_search_highest cpu_search_highest
> 3.4 kernel memset_erms dbuf_read
> 3.0 kernel mb_ctor_pack uma_zalloc_arg
> 2.6 kernel soreceive_generic_locked soreceive_generic
> 2.2 kernel lock_delay dbuf_find
>
> Regards,
> Olivier
--
Rod Grimes rgrimes@freebsd.org