Re: 100Gb performance
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Thu, 26 Jun 2025 03:41:20 UTC
> On Tue, Jun 24, 2025 at 4:39?PM Rodney W. Grimes > <freebsd-rwg@gndrsh.dnsmgr.net> wrote: > > > > > On Tue, Jun 24, 2025 at 4:45?PM Rick Macklem <rick.macklem@gmail.com> wrote: > > > > > > > > > > > Here's how I'd configure a client (assuming it's a fairly beefy system): > > > > In /boot/loader.conf: > > > > vfs.maxbcachebuf=1048576 > > > > > > > > In /etc/sysctl.conf: > > > > kern.ipc.maxsockbuf=47370024 (or larger) > > > > vfs.nfs.iodmax=64 > > > > > > > > Then I'd use these mount options (along with whatever you normally use, > > > > except don't specify rsize, wsize since it should use whatever the server > > > > supports): > > > > nconnect=8,nocto,readahead=8,wcommitsize=67108864 (or larger) > > > > > > > > To test write rate, I'd: > > > > # dd if=/dev/zero of=<file on mount> bs=1M count=10240 > > > > for reading > > > > # dd if=<file on mount> of=/dev/null bs=1M > > > > (but umount/mount between the two "dd"s, so nothing is cached > > > > in the client's buffer cache) > > > > > > > > If you are stuck at 1.2Gbytes/sec, there's some bottleneck, but > > > > I can't say where. > > > > > > > > rick > > > > ps: The newnfs threads to write-behind and read-ahead, so there > > > > is some parallelism for the "dd". > > > > > > > > > > > Hi, > > > > > > Ok let?s try that all those parameters (running June 2025 stableweek) : > > > > > > On server and client, /etc/sysctl.conf configured with a: > > > kern.ipc.maxsockbuf=33554432 > > > net.inet.tcp.recvbuf_max=33554432 > > > net.inet.tcp.sendbuf_max=33554432 > > > net.inet.tcp.recvspace=1048576 > > > net.inet.tcp.sendspace=524288 > > > vfs.nfs.iodmax=64 > > > > I suggested doubling, or quadropling the defaults of 2mb, > > why did you only try 1.5 times? > I think he increased it to 32Mbyes and not 3? Your correct, all the double digits seems to have confused my parsing. I was how ever interested in the data points that doing a 1, 2, 4, 8x would of given us. I am also not sure if the recvspace is large enough. > > Here's my way of gestimating it... > - For a 1msec transit time (it's probably less). > 100Mbits/msec * 1msec = 100Mbits / 8 = 12Mbytes > - This is what it takes to fill the bit pipe. > - It takes the same time for an ACK to transit in the > opposite direction, so double it for that. > Put another way, get the RTT via ping and then: > 100Mbits/msec * RTT(msec) / 8 > However, this doesn't account for delay in the server > processing the TCP segment and sending an ACK, > so I'd bump the above up by a bunch. > > Does the above look reasonable? rick Reasonable, expect that your 1msec is probably far to long for modern connections at 1Gb/s and above. Even 1Gb/s ethernet has a <400uS ping (so RTT) when doing 1500 byte packets, and <200uS when doing minimal length. > > > > > > > > > Server side: > > > nfs_server_enable="YES" > > > nfsv4_server_enable="YES" > > > nfsv4_server_only="YES" > > > nfs_server_maxio="1048576" > > > With correctly applied sysctl: > > > root@server:~ # sysctl vfs.nfsd.srvmaxio > > > vfs.nfsd.srvmaxio: 1048576 > > > root@server:~ # sysctl vfs.nfs.iodmax > > > vfs.nfs.iodmax: 64 > > > > > > First, just generating the server disk speed to be used as reference: > > > root@server:~ # dd if=/dev/zero of=/tmp/nfs/data bs=1M count=20480 > > > 20480+0 records in > > > 20480+0 records out > > > 21474836480 bytes transferred in 3.477100 secs (6176076082 bytes/sec) > > > root@server:~ # units -t '6176076082 bytes' gigabit > > > 49.408609 > > > > > > So here, reaching about 40Gb/s with NFS will be the target. > > > > > > But before the NFS test, a simple iperf3 test between client and server > > > with 16?sessions (same as with nconnect): > > > root@client:~ # iperf3 -c 1.1.1.30 --parallel 16 > > ^^^^^^^^^^^^^^ > > > [SUM] 0.00-10.00 sec 99.1 GBytes 85.1 Gbits/sec 81693 sender > > > > > > The 100Gb/s link is here and seems to be working fine with iperf3. > > > > I am going to assume you kept cranking up the parallel count until > > you reached what you feel to be "working fine". I would be very intrested > > in the data at 1, 2, 4, 8 and 16, and not just the final number > > but the actual "test data" as output during the run of iperf3. > > Especially of value is what the window sizes look like. > > > > Also what is the ping time between the client and server? > > ping -q -s 1500 -c 10 servername > > > > > > > > On the client side, the NFS test now: > > > root@client:~ # mount -t nfs -o > > > noatime,nfsv4,nconnect=16,wcommitsize=67108864,readahead=8,nocto 1.1.1.30:/nfs > > > /tmp/nfs/ > > > root@client:~ # nfsstat -m > > > 1.1.1.30:/nfs on /tmp/nfs > > > nfsv4,minorversion=2,tcp,resvport,nconnect=16,hard,nocto,sec=sys,acdirmin=3,acdirmax=60,acregmin=5,acregmax=60,nametimeo=60,negnametimeo=60,rsize=65536,wsize=65536,readdirsize=65536,readahead=8,wcommitsize=67108864,timeout=120,retrans=2147483647 > > > > > > => Notice here that negotiated rsize and wsize haven't improved since the > > > bump of vfs.nfsd.srvmaxio on server side. Shouldn't those values be a lot > > > bigger at this stage ? > > > > > > root@cliet:~ # dd if=/dev/zero of=/tmp/nfs/data bs=1M count=20480 > > > 20480+0 records in > > > 20480+0 records out > > > 21474836480 bytes transferred in 9.591257 secs (2239001240 bytes/sec) > > > root@client:~ # units -t '2239001240 bytes' gigabit > > > 17.91201 > > > root@client:~ # umount /tmp/nfs/ > > > root@client:~ # mount -t nfs -o > > > noatime,nfsv4,nconnect=16,wcommitsize=67108864,readahead=8,nocto 1.1.1.30:/nfs > > > /tmp/nfs/ > > > root@client:~ # dd of=/dev/zero if=/tmp/nfs/data bs=1M count=20480 > > > 20480+0 records in > > > 20480+0 records out > > > 21474836480 bytes transferred in 6.900937 secs (3111872643 bytes/sec) > > > root@client:~ # units -t '3111872643 bytes' gigabit > > > 24.894981 > > > > > > So with NFS I?m able to read at about 25Gb/s and write at 18Gb/s. > > > > > > The output of a "pmcstat -TS cpu_clk_unhalted.thread_p -w1" on the client > > > during this test shows a high level of invlop_handler: > > > > > > PMC: [cpu_clk_unhalted.thread_p] Samples: 9730 (100.0%) , 0 unresolved > > > > > > %SAMP IMAGE FUNCTION CALLERS > > > 31.2 kernel invlop_handler > > > 24.9 kernel cpu_idle sched_idletd > > > 11.4 kernel Xinvlop > > > 1.8 kernel copyin_smap_erms uiomove_faultflag > > > 1.8 kernel memmove_erms nfsm_uiombuf > > > 1.5 kernel cpu_search_highest cpu_search_highest > > > 1.3 kernel mb_free_ext m_free > > > > > > And on the server: > > > > > > PMC: [cpu_clk_unhalted.thread_p] Samples: 4093 (100.0%) , 0 unresolved > > > > > > %SAMP IMAGE FUNCTION CALLERS > > > 7.8 zfs.ko abd_cmp_zero_off_cb abd_iterate_func > > > 7.7 kernel memmove_erms uiomove_faultflag > > > 4.9 kernel cpu_idle sched_idletd > > > 4.8 kernel mlx5e_rx_cq_comp mlx5_cq_completion > > > 3.4 kernel cpu_search_highest cpu_search_highest > > > 3.4 kernel memset_erms dbuf_read > > > 3.0 kernel mb_ctor_pack uma_zalloc_arg > > > 2.6 kernel soreceive_generic_locked soreceive_generic > > > 2.2 kernel lock_delay dbuf_find > > > > > > Regards, > > > Olivier > > > > -- > > Rod Grimes rgrimes@freebsd.org > > -- Rod Grimes rgrimes@freebsd.org