Re: 100Gb performance

From: Olivier_Cochard-Labbé <olivier_at_freebsd.org>
Date: Tue, 24 Jun 2025 16:15:07 UTC
On Tue, Jun 24, 2025 at 5:56 PM Rick Macklem <rick.macklem@gmail.com> wrote:

> On Tue, Jun 24, 2025 at 8:47 AM Olivier Cochard-Labbé
> <olivier@freebsd.org> wrote:
> >
> >
> > On Tue, Jun 24, 2025 at 4:45 PM Rick Macklem <rick.macklem@gmail.com>
> wrote:
> >>
> >>
> >> Here's how I'd configure a client (assuming it's a fairly beefy system):
> >> In /boot/loader.conf:
> >> vfs.maxbcachebuf=1048576
> >>
> >> In /etc/sysctl.conf:
> >> kern.ipc.maxsockbuf=47370024 (or larger)
> >> vfs.nfs.iodmax=64
> >>
> >> Then I'd use these mount options (along with whatever you normally use,
> >> except don't specify rsize, wsize since it should use whatever the
> server
> >> supports):
> >> nconnect=8,nocto,readahead=8,wcommitsize=67108864 (or larger)
> >>
> >> To test write rate, I'd:
> >> # dd if=/dev/zero of=<file on mount> bs=1M count=10240
> >> for reading
> >> # dd if=<file on mount> of=/dev/null bs=1M
> >> (but umount/mount between the two "dd"s, so nothing is cached
> >> in the client's buffer cache)
> >>
> >> If you are stuck at 1.2Gbytes/sec, there's some bottleneck, but
> >> I can't say where.
> >>
> >> rick
> >> ps: The newnfs threads to write-behind and read-ahead, so there
> >>      is some parallelism for the "dd".
> >>
> >
> > Hi,
> >
> > Ok let’s try that all those parameters (running June 2025 stableweek) :
> >
> > On server and client, /etc/sysctl.conf configured with a:
> > kern.ipc.maxsockbuf=33554432
> > net.inet.tcp.recvbuf_max=33554432
> > net.inet.tcp.sendbuf_max=33554432
> > net.inet.tcp.recvspace=1048576
> > net.inet.tcp.sendspace=524288
> > vfs.nfs.iodmax=64
> >
> > Server side:
> > nfs_server_enable="YES"
> > nfsv4_server_enable="YES"
> > nfsv4_server_only="YES"
> > nfs_server_maxio="1048576"
> > With correctly applied sysctl:
> > root@server:~ # sysctl vfs.nfsd.srvmaxio
> > vfs.nfsd.srvmaxio: 1048576
> > root@server:~ # sysctl vfs.nfs.iodmax
> > vfs.nfs.iodmax: 64
> >
> > First, just generating the server disk speed to be used as reference:
> > root@server:~ # dd if=/dev/zero of=/tmp/nfs/data bs=1M count=20480
> > 20480+0 records in
> > 20480+0 records out
> > 21474836480 bytes transferred in 3.477100 secs (6176076082 bytes/sec)
> > root@server:~ # units -t '6176076082 bytes' gigabit
> > 49.408609
> >
> > So here, reaching about 40Gb/s with NFS will be the target.
> >
> > But before the NFS test, a simple iperf3 test between client and server
> with 16 sessions (same as with nconnect):
> > root@client:~ # iperf3 -c 1.1.1.30 --parallel 16
> > [SUM]   0.00-10.00  sec  99.1 GBytes  85.1 Gbits/sec  81693  sender
> >
> > The 100Gb/s link is here and seems to be working fine with iperf3.
> >
> > On the client side, the NFS test now:
> > root@client:~ # mount -t nfs -o
> noatime,nfsv4,nconnect=16,wcommitsize=67108864,readahead=8,nocto 1.1.1.30:/nfs
> /tmp/nfs/
> > root@client:~ # nfsstat -m
> > 1.1.1.30:/nfs on /tmp/nfs
> >
> nfsv4,minorversion=2,tcp,resvport,nconnect=16,hard,nocto,sec=sys,acdirmin=3,acdirmax=60,acregmin=5,acregmax=60,nametimeo=60,negnametimeo=60,rsize=65536,wsize=65536,readdirsize=65536,readahead=8,wcommitsize=67108864,timeout=120,retrans=2147483647
> >
> > => Notice here that negotiated rsize and wsize haven't improved since
> the bump of vfs.nfsd.srvmaxio on server side. Shouldn't those values be a
> lot bigger at this stage ?
> Yep. Did you reboot the client after putting
> vfs.maxbcachebuf=1048576
> in /boot/loader.conf?
> (It's a tunable, so it needs to be set at boot time.)
> The rsize, wsize should be 1048576.
>
> rick
>

Indeed, I’ve forgot about this sysctl !

With the correct settings:

root@client:~ # sysctl vfs.maxbcachebuf
vfs.maxbcachebuf: 1048576
root@client:~ # mount -t nfs -o
noatime,nfsv4,nconnect=16,wcommitsize=67108864,readahead=8,nocto 1.1.1.30:/nfs
/tmp/nfs/
root@client:~ # nfsstat -m
1.1.1.30:/nfs on /tmp/nfs
nfsv4,minorversion=2,tcp,resvport,nconnect=16,hard,nocto,sec=sys,acdirmin=3,acdirmax=60,acregmin=5,acregmax=60,nametimeo=60,negnametimeo=60,rsize=1048576,wsize=1048576,readdirsize=1048576,readahead=8,wcommitsize=67108864,timeout=120,retrans=2147483647

=> The negotiated rsize and wsize are now bigger and this fixes the read
speed (not the write):

root@client:~ # dd if=/dev/zero of=/tmp/nfs/data bs=1M count=20480
20480+0 records in
20480+0 records out
21474836480 bytes transferred in 7.574187 secs (2835266137 bytes/sec)
root@client:~ # units -t '2835266137 bytes' gigabit
22.682129
root@client:~ # umount /tmp/nfs/
root@client:~ # mount -t nfs -o
noatime,nfsv4,nconnect=16,wcommitsize=67108864,readahead=8,nocto 1.1.1.30:/nfs
/tmp/nfs/
root@client:~ # dd of=/dev/zero if=/tmp/nfs/data bs=1M count=20480
20480+0 records in
20480+0 records out
21474836480 bytes transferred in 4.168176 secs (5152094642 bytes/sec)
root@client:~ # units -t '5152094642 bytes' gigabit
41.216757

Thanks,
Olivier