Re: 100Gb performance

Reply: Rick Macklem : "Re: 100Gb performance"
In reply to: Rick Macklem : "Re: 100Gb performance"
Go to: [ bottom of page ] [ top of archives ] [ this month ]
From: Olivier_Cochard-Labbé <olivier_at_freebsd.org>
Date: Tue, 24 Jun 2025 15:47:07 UTC
On Tue, Jun 24, 2025 at 4:45 PM Rick Macklem <rick.macklem@gmail.com> wrote:

>
> Here's how I'd configure a client (assuming it's a fairly beefy system):
> In /boot/loader.conf:
> vfs.maxbcachebuf=1048576
>
> In /etc/sysctl.conf:
> kern.ipc.maxsockbuf=47370024 (or larger)
> vfs.nfs.iodmax=64
>
> Then I'd use these mount options (along with whatever you normally use,
> except don't specify rsize, wsize since it should use whatever the server
> supports):
> nconnect=8,nocto,readahead=8,wcommitsize=67108864 (or larger)
>
> To test write rate, I'd:
> # dd if=/dev/zero of=<file on mount> bs=1M count=10240
> for reading
> # dd if=<file on mount> of=/dev/null bs=1M
> (but umount/mount between the two "dd"s, so nothing is cached
> in the client's buffer cache)
>
> If you are stuck at 1.2Gbytes/sec, there's some bottleneck, but
> I can't say where.
>
> rick
> ps: The newnfs threads to write-behind and read-ahead, so there
>      is some parallelism for the "dd".
>
>
Hi,

Ok let’s try that all those parameters (running June 2025 stableweek) :

On server and client, /etc/sysctl.conf configured with a:
kern.ipc.maxsockbuf=33554432
net.inet.tcp.recvbuf_max=33554432
net.inet.tcp.sendbuf_max=33554432
net.inet.tcp.recvspace=1048576
net.inet.tcp.sendspace=524288
vfs.nfs.iodmax=64

Server side:
nfs_server_enable="YES"
nfsv4_server_enable="YES"
nfsv4_server_only="YES"
nfs_server_maxio="1048576"
With correctly applied sysctl:
root@server:~ # sysctl vfs.nfsd.srvmaxio
vfs.nfsd.srvmaxio: 1048576
root@server:~ # sysctl vfs.nfs.iodmax
vfs.nfs.iodmax: 64

First, just generating the server disk speed to be used as reference:
root@server:~ # dd if=/dev/zero of=/tmp/nfs/data bs=1M count=20480
20480+0 records in
20480+0 records out
21474836480 bytes transferred in 3.477100 secs (6176076082 bytes/sec)
root@server:~ # units -t '6176076082 bytes' gigabit
49.408609

So here, reaching about 40Gb/s with NFS will be the target.

But before the NFS test, a simple iperf3 test between client and server
with 16 sessions (same as with nconnect):
root@client:~ # iperf3 -c 1.1.1.30 --parallel 16
[SUM]   0.00-10.00  sec  99.1 GBytes  85.1 Gbits/sec  81693  sender

The 100Gb/s link is here and seems to be working fine with iperf3.

On the client side, the NFS test now:
root@client:~ # mount -t nfs -o
noatime,nfsv4,nconnect=16,wcommitsize=67108864,readahead=8,nocto 1.1.1.30:/nfs
/tmp/nfs/
root@client:~ # nfsstat -m
1.1.1.30:/nfs on /tmp/nfs
nfsv4,minorversion=2,tcp,resvport,nconnect=16,hard,nocto,sec=sys,acdirmin=3,acdirmax=60,acregmin=5,acregmax=60,nametimeo=60,negnametimeo=60,rsize=65536,wsize=65536,readdirsize=65536,readahead=8,wcommitsize=67108864,timeout=120,retrans=2147483647

=> Notice here that negotiated rsize and wsize haven't improved since the
bump of vfs.nfsd.srvmaxio on server side. Shouldn't those values be a lot
bigger at this stage ?

root@cliet:~ # dd if=/dev/zero of=/tmp/nfs/data bs=1M count=20480
20480+0 records in
20480+0 records out
21474836480 bytes transferred in 9.591257 secs (2239001240 bytes/sec)
root@client:~ # units -t '2239001240 bytes' gigabit
17.91201
root@client:~ # umount /tmp/nfs/
root@client:~ # mount -t nfs -o
noatime,nfsv4,nconnect=16,wcommitsize=67108864,readahead=8,nocto 1.1.1.30:/nfs
/tmp/nfs/
root@client:~ # dd of=/dev/zero if=/tmp/nfs/data bs=1M count=20480
20480+0 records in
20480+0 records out
21474836480 bytes transferred in 6.900937 secs (3111872643 bytes/sec)
root@client:~ # units -t '3111872643 bytes' gigabit
24.894981

So with NFS I’m able to read at about 25Gb/s and write at 18Gb/s.

The output of a "pmcstat -TS cpu_clk_unhalted.thread_p -w1" on the client
during this test shows a high level of invlop_handler:

PMC: [cpu_clk_unhalted.thread_p] Samples: 9730 (100.0%) , 0 unresolved

%SAMP IMAGE      FUNCTION                       CALLERS
 31.2 kernel     invlop_handler
 24.9 kernel     cpu_idle                       sched_idletd
 11.4 kernel     Xinvlop
  1.8 kernel     copyin_smap_erms               uiomove_faultflag
  1.8 kernel     memmove_erms                   nfsm_uiombuf
  1.5 kernel     cpu_search_highest             cpu_search_highest
  1.3 kernel     mb_free_ext                    m_free

And on the server:

PMC: [cpu_clk_unhalted.thread_p] Samples: 4093 (100.0%) , 0 unresolved

%SAMP IMAGE      FUNCTION                       CALLERS
  7.8 zfs.ko     abd_cmp_zero_off_cb            abd_iterate_func
  7.7 kernel     memmove_erms                   uiomove_faultflag
  4.9 kernel     cpu_idle                       sched_idletd
  4.8 kernel     mlx5e_rx_cq_comp               mlx5_cq_completion
  3.4 kernel     cpu_search_highest             cpu_search_highest
  3.4 kernel     memset_erms                    dbuf_read
  3.0 kernel     mb_ctor_pack                   uma_zalloc_arg
  2.6 kernel     soreceive_generic_locked       soreceive_generic
  2.2 kernel     lock_delay                     dbuf_find

Regards,
Olivier