Re: 100Gb performance

From: Daniel Braniss <danny_at_cs.huji.ac.il>
Date: Tue, 24 Jun 2025 16:55:13 UTC

> On 24 Jun 2025, at 19:15, Olivier Cochard-Labbé <olivier@freebsd.org> wrote:
> 
> 
> On Tue, Jun 24, 2025 at 5:56 PM Rick Macklem <rick.macklem@gmail.com <mailto:rick.macklem@gmail.com>> wrote:
>> On Tue, Jun 24, 2025 at 8:47 AM Olivier Cochard-Labbé
>> <olivier@freebsd.org <mailto:olivier@freebsd.org>> wrote:
>> >
>> >
>> > On Tue, Jun 24, 2025 at 4:45 PM Rick Macklem <rick.macklem@gmail.com <mailto:rick.macklem@gmail.com>> wrote:
>> >>
>> >>
>> >> Here's how I'd configure a client (assuming it's a fairly beefy system):
>> >> In /boot/loader.conf:
>> >> vfs.maxbcachebuf=1048576

Need to reboot for that one, I can’t now :-(

>> >>
>> >> In /etc/sysctl.conf:
>> >> kern.ipc.maxsockbuf=47370024 (or larger)
>> >> vfs.nfs.iodmax=64
>> >>
>> >> Then I'd use these mount options (along with whatever you normally use,
>> >> except don't specify rsize, wsize since it should use whatever the server
>> >> supports):
>> >> nconnect=8,nocto,readahead=8,wcommitsize=67108864 (or larger)
>> >>
>> >> To test write rate, I'd:
>> >> # dd if=/dev/zero of=<file on mount> bs=1M count=10240
>> >> for reading
>> >> # dd if=<file on mount> of=/dev/null bs=1M
>> >> (but umount/mount between the two "dd"s, so nothing is cached
>> >> in the client's buffer cache)
>> >>
>> >> If you are stuck at 1.2Gbytes/sec, there's some bottleneck, but
>> >> I can't say where.
>> >>
>> >> rick
>> >> ps: The newnfs threads to write-behind and read-ahead, so there
>> >>      is some parallelism for the "dd".
>> >>
>> >
>> > Hi,
>> >
>> > Ok let’s try that all those parameters (running June 2025 stableweek) :
>> >
>> > On server and client, /etc/sysctl.conf configured with a:
>> > kern.ipc.maxsockbuf=33554432
>> > net.inet.tcp.recvbuf_max=33554432
>> > net.inet.tcp.sendbuf_max=33554432
>> > net.inet.tcp.recvspace=1048576
>> > net.inet.tcp.sendspace=524288
>> > vfs.nfs.iodmax=64
>> >
>> > Server side:
>> > nfs_server_enable="YES"
>> > nfsv4_server_enable="YES"
>> > nfsv4_server_only="YES"
>> > nfs_server_maxio="1048576"
>> > With correctly applied sysctl:
>> > root@server:~ # sysctl vfs.nfsd.srvmaxio
>> > vfs.nfsd.srvmaxio: 1048576
>> > root@server:~ # sysctl vfs.nfs.iodmax
>> > vfs.nfs.iodmax: 64
>> >
>> > First, just generating the server disk speed to be used as reference:
>> > root@server:~ # dd if=/dev/zero of=/tmp/nfs/data bs=1M count=20480
>> > 20480+0 records in
>> > 20480+0 records out
>> > 21474836480 bytes transferred in 3.477100 secs (6176076082 bytes/sec)
>> > root@server:~ # units -t '6176076082 bytes' gigabit
>> > 49.408609
>> >
>> > So here, reaching about 40Gb/s with NFS will be the target.
>> >
>> > But before the NFS test, a simple iperf3 test between client and server with 16 sessions (same as with nconnect):
>> > root@client:~ # iperf3 -c 1.1.1.30 --parallel 16
>> > [SUM]   0.00-10.00  sec  99.1 GBytes  85.1 Gbits/sec  81693  sender
>> >
>> > The 100Gb/s link is here and seems to be working fine with iperf3.
>> >
>> > On the client side, the NFS test now:
>> > root@client:~ # mount -t nfs -o noatime,nfsv4,nconnect=16,wcommitsize=67108864,readahead=8,nocto 1.1.1.30:/nfs /tmp/nfs/
>> > root@client:~ # nfsstat -m
>> > 1.1.1.30:/nfs on /tmp/nfs
>> > nfsv4,minorversion=2,tcp,resvport,nconnect=16,hard,nocto,sec=sys,acdirmin=3,acdirmax=60,acregmin=5,acregmax=60,nametimeo=60,negnametimeo=60,rsize=65536,wsize=65536,readdirsize=65536,readahead=8,wcommitsize=67108864,timeout=120,retrans=2147483647
>> >
>> > => Notice here that negotiated rsize and wsize haven't improved since the bump of vfs.nfsd.srvmaxio on server side. Shouldn't those values be a lot bigger at this stage ?
>> Yep. Did you reboot the client after putting
>> vfs.maxbcachebuf=1048576
>> in /boot/loader.conf?
>> (It's a tunable, so it needs to be set at boot time.)
>> The rsize, wsize should be 1048576.
>> 
>> rick
> 
> Indeed, I’ve forgot about this sysctl !
> 
> With the correct settings:
> 
> root@client:~ # sysctl vfs.maxbcachebuf
> vfs.maxbcachebuf: 1048576
> root@client:~ # mount -t nfs -o noatime,nfsv4,nconnect=16,wcommitsize=67108864,readahead=8,nocto 1.1.1.30:/nfs /tmp/nfs/
> root@client:~ # nfsstat -m
> 1.1.1.30:/nfs on /tmp/nfs
> nfsv4,minorversion=2,tcp,resvport,nconnect=16,hard,nocto,sec=sys,acdirmin=3,acdirmax=60,acregmin=5,acregmax=60,nametimeo=60,negnametimeo=60,rsize=1048576,wsize=1048576,readdirsize=1048576,readahead=8,wcommitsize=67108864,timeout=120,retrans=2147483647
> 
> => The negotiated rsize and wsize are now bigger and this fixes the read speed (not the write):
> 
> root@client:~ # dd if=/dev/zero of=/tmp/nfs/data bs=1M count=20480
> 20480+0 records in
> 20480+0 records out
> 21474836480 bytes transferred in 7.574187 secs (2835266137 bytes/sec)
> root@client:~ # units -t '2835266137 bytes' gigabit
> 22.682129
> root@client:~ # umount /tmp/nfs/
> root@client:~ # mount -t nfs -o noatime,nfsv4,nconnect=16,wcommitsize=67108864,readahead=8,nocto 1.1.1.30:/nfs /tmp/nfs/
> root@client:~ # dd of=/dev/zero if=/tmp/nfs/data bs=1M count=20480
> 20480+0 records in
> 20480+0 records out
> 21474836480 bytes transferred in 4.168176 secs (5152094642 bytes/sec)
> root@client:~ # units -t '5152094642 bytes' gigabit
> 41.216757
> 
> Thanks,
> Olivier