Re: zfs/nfsd performance limiter
- Reply: Rick Macklem : "Re: zfs/nfsd performance limiter"
- Reply: Rick Macklem : "Re: zfs/nfsd performance limiter"
- Reply: John : "Re: zfs/nfsd performance limiter"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Sun, 22 May 2022 13:35:52 UTC
> What is your server system? Make/model/ram/etc.
Intel(R) Xeon(R) CPU E5-2603 v4 @ 1.70GHz (6 cores, a little starved
on the clock but the load at least is basically zero during this test)
128GB of memory
> top -aH
During the copy load (for brevity, only did the real top contenders
for CPU here):
last pid: 15560; load averages: 0.25, 0.39, 0.27
up 4+15:48:54
09:17:38
98 threads: 2 running, 96 sleeping
CPU: 0.0% user, 0.0% nice, 19.1% system, 5.6% interrupt, 75.3% idle
Mem: 12M Active, 4405M Inact, 8284K Laundry, 115G Wired, 1148M Buf, 4819M Free
ARC: 98G Total, 80G MFU, 15G MRU, 772K Anon, 1235M Header, 1042M Other
91G Compressed, 189G Uncompressed, 2.09:1 Ratio
Swap: 5120M Total, 5120M Free
PID USERNAME PRI NICE SIZE RES STATE C TIME WCPU COMMAND
3830 root 20 0 12M 2700K rpcsvc 2 1:16 53.26%
nfsd: server (nfsd){nfsd: service}
3830 root 20 0 12M 2700K CPU5 5 5:42 52.96%
nfsd: server (nfsd){nfsd: master}
15560 adam 20 0 17M 5176K CPU2 2 0:00 0.12% top -aH
1493 root 20 0 13M 2260K select 3 0:36 0.01%
/usr/sbin/powerd
1444 root 20 0 75M 2964K select 5 0:19 0.01%
/usr/sbin/mountd -r /etc/exports /etc/zfs/exports
1215 uucp 20 0 13M 2820K select 5 0:27 0.01%
/usr/local/libexec/nut/usbhid-ups -a cyberpower
93424 adam 20 0 21M 9900K select 0 0:00 0.01%
sshd: adam@pts/0 (sshd)
> ifconfig -vm
mlxen0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 9000
options=ed07bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,VLAN_HWFILTER,VLAN_HWTSO,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6>
capabilities=ed07bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,VLAN_HWFILTER,VLAN_HWTSO,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6>
ether 00:02:c9:35:df:20
inet 10.5.5.1 netmask 0xffffff00 broadcast 10.5.5.255
media: Ethernet autoselect (40Gbase-CR4 <full-duplex,rxpause,txpause>)
status: active
supported media:
media autoselect
media 40Gbase-CR4 mediaopt full-duplex
media 10Gbase-CX4 mediaopt full-duplex
media 10Gbase-SR mediaopt full-duplex
media 1000baseT mediaopt full-duplex
nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
plugged: QSFP+ 40GBASE-CR4 (No separable connector)
vendor: Mellanox PN: MC2207130-002 SN: MT1419VS07971 DATE: 2014-06-06
module temperature: 0.00 C voltage: 0.00 Volts
lane 1: RX power: 0.00 mW (-inf dBm) TX bias: 0.00 mA
lane 2: RX power: 0.00 mW (-inf dBm) TX bias: 0.00 mA
lane 3: RX power: 0.00 mW (-inf dBm) TX bias: 0.00 mA
lane 4: RX power: 0.00 mW (-inf dBm) TX bias: 0.00 mA
> - What are your values for:
>
> -- kern.ipc.maxsockbuf
> -- net.inet.tcp.sendbuf_max
> -- net.inet.tcp.recvbuf_max
>
> -- net.inet.tcp.sendspace
> -- net.inet.tcp.recvspace
>
> -- net.inet.tcp.delayed_ack
kern.ipc.maxsockbuf: 16777216
net.inet.tcp.sendbuf_max: 16777216
net.inet.tcp.recvbuf_max: 16777216
net.inet.tcp.sendspace: 32768 # This is interesting? I'm not sure why
the discrepancy here
net.inet.tcp.recvspace: 4194304
net.inet.tcp.delayed_ack: 0
> netstat -i
Name Mtu Network Address Ipkts Ierrs Idrop
Opkts Oerrs Coll
igb0 9000 <Link#1> ac:1f:6b:b0:60:bc 18230625 0 0
24178283 0 0
igb1 9000 <Link#2> ac:1f:6b:b0:60:bc 14341213 0 0
8447249 0 0
lo0 16384 <Link#3> lo0 367691 0 0
367691 0 0
lo0 - localhost localhost 68 - -
68 - -
lo0 - fe80::%lo0/64 fe80::1%lo0 0 - -
0 - -
lo0 - your-net localhost 348944 - -
348944 - -
mlxen 9000 <Link#4> 00:02:c9:35:df:20 13138046 0 12
26308206 0 0
mlxen - 10.5.5.0/24 10.5.5.1 11592389 - -
24345184 - -
vm-pu 9000 <Link#6> 56:3e:55:8a:2a:f8 7270 0 0
962249 102 0
lagg0 9000 <Link#5> ac:1f:6b:b0:60:bc 31543941 0 0
31623674 0 0
lagg0 - 192.168.0.0/2 nasbox 27967582 - -
41779731 - -
> What threads/irq are allocated to your NIC? 'vmstat -i'
Doesn't seem perfectly balanced but not terribly imbalanced, either:
interrupt total rate
irq9: acpi0 3 0
irq18: ehci0 ehci1+ 803162 2
cpu0:timer 67465114 167
cpu1:timer 65068819 161
cpu2:timer 65535300 163
cpu3:timer 63408731 157
cpu4:timer 63026304 156
cpu5:timer 63431412 157
irq56: nvme0:admin 18 0
irq57: nvme0:io0 544999 1
irq58: nvme0:io1 465816 1
irq59: nvme0:io2 487486 1
irq60: nvme0:io3 474616 1
irq61: nvme0:io4 452527 1
irq62: nvme0:io5 467807 1
irq63: mps0 36110415 90
irq64: mps1 112328723 279
irq65: mps2 54845974 136
irq66: mps3 50770215 126
irq68: xhci0 3122136 8
irq70: igb0:rxq0 1974562 5
irq71: igb0:rxq1 3034190 8
irq72: igb0:rxq2 28703842 71
irq73: igb0:rxq3 1126533 3
irq74: igb0:aq 7 0
irq75: igb1:rxq0 1852321 5
irq76: igb1:rxq1 2946722 7
irq77: igb1:rxq2 9602613 24
irq78: igb1:rxq3 4101258 10
irq79: igb1:aq 8 0
irq80: ahci1 37386191 93
irq81: mlx4_core0 4748775 12
irq82: mlx4_core0 13754442 34
irq83: mlx4_core0 3551629 9
irq84: mlx4_core0 2595850 6
irq85: mlx4_core0 4947424 12
Total 769135944 1908
> Are the above threads floating or mapped? 'cpuset -g ...'
I suspect I was supposed to run this against the argument of a pid,
maybe nfsd? Here's the output without an argument
pid -1 mask: 0, 1, 2, 3, 4, 5
pid -1 domain policy: first-touch mask: 0
> Disable nfs tcp drc
This is the first I've even seen a duplicate request cache mentioned.
It seems counter-intuitive for why that'd help but maybe I'll try
doing that. What exactly is the benefit?
> What is your atime setting?
Disabled at both the file system and the client mounts.
> You also state you are using a Linux client. Are you using the MLX affinity scripts, buffer sizing suggestions, etc, etc. Have you swapped the Linux system for a fbsd system?
I've not, though I do vaguely recall mellanox supplying some scripts
in their documentation that fixed interrupt handling on specific cores
at one point. Is this what you're referring to? I could give that a
try. I don't at present have any FreeBSD client systems with enough
PCI express bandwidth to swap things out for a Linux vs FreeBSD test.
> You mention iperf. Please post the options you used when invoking iperf and it's output.
Setting up the NFS client as a "server", since it seems that the
terminology is a little bit flipped with iperf, here's the output:
-----------------------------------------------------------
Server listening on 5201 (test #1)
-----------------------------------------------------------
Accepted connection from 10.5.5.1, port 11534
[ 5] local 10.5.5.4 port 5201 connected to 10.5.5.1 port 43931
[ ID] Interval Transfer Bitrate
[ 5] 0.00-1.00 sec 3.81 GBytes 32.7 Gbits/sec
[ 5] 1.00-2.00 sec 4.20 GBytes 36.1 Gbits/sec
[ 5] 2.00-3.00 sec 4.18 GBytes 35.9 Gbits/sec
[ 5] 3.00-4.00 sec 4.21 GBytes 36.1 Gbits/sec
[ 5] 4.00-5.00 sec 4.20 GBytes 36.1 Gbits/sec
[ 5] 5.00-6.00 sec 4.21 GBytes 36.2 Gbits/sec
[ 5] 6.00-7.00 sec 4.10 GBytes 35.2 Gbits/sec
[ 5] 7.00-8.00 sec 4.20 GBytes 36.1 Gbits/sec
[ 5] 8.00-9.00 sec 4.21 GBytes 36.1 Gbits/sec
[ 5] 9.00-10.00 sec 4.20 GBytes 36.1 Gbits/sec
[ 5] 10.00-10.00 sec 7.76 MBytes 35.3 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate
[ 5] 0.00-10.00 sec 41.5 GBytes 35.7 Gbits/sec receiver
-----------------------------------------------------------
Server listening on 5201 (test #2)
-----------------------------------------------------------
On Sun, May 22, 2022 at 3:45 AM John <jwd@freebsd.org> wrote:
>
> ----- Adam Stylinski's Original Message -----
> > Hello,
> >
> > I have two systems connected via ConnectX-3 mellanox cards in ethernet
> > mode. They have their MTU's maxed at 9000, their ring buffers maxed
> > at 8192, and I can hit around 36 gbps with iperf.
> >
> > When using an NFS client (client = linux, server = freebsd), I see a
> > maximum rate of around 20gbps. The test file is fully in ARC. The
> > test is performed with an NFS mount nconnect=4 and an rsize/wsize of
> > 1MB.
> >
> > Here's the flame graph of the kernel of the system in question, with
> > idle stacks removed:
> >
> > https://gist.github.com/KungFuJesus/918c6dcf40ae07767d5382deafab3a52#file-nfs_fg-svg
> >
> > The longest functions seems like maybe it's the ERMS aware memcpy
> > happening from the ARC? Is there maybe a missing fast path that could
> > take fewer copies into the socket buffer?
>
> Hi Adam -
>
> Some items to look at and possibly include for more responses....
>
> - What is your server system? Make/model/ram/etc. What is your
> overall 'top' cpu utilization 'top -aH' ...
>
> - It looks like you're using a 40gb/s card. Posting the output of
> 'ifconfig -vm' would provide additional information.
>
> - Are the interfaces running cleanly? 'netstat -i' is helpful.
>
> - Inspect 'netstat -s'. Duplicate pkts? Resends? Out-of-order?
>
> - Inspect 'netstat -m'. Denied? Delayed?
>
>
> - You mention iperf. Please post the options you used when
> invoking iperf and it's output.
>
> - You appear to be looking for through-put vs low-latency. Have
> you looked at window-size vs the amount of memory allocated to the
> streams. These values vary based on the bit-rate of the connection.
> Tcp connections require outstanding un-ack'd data to be held.
> Effects values below.
>
>
> - What are your values for:
>
> -- kern.ipc.maxsockbuf
> -- net.inet.tcp.sendbuf_max
> -- net.inet.tcp.recvbuf_max
>
> -- net.inet.tcp.sendspace
> -- net.inet.tcp.recvspace
>
> -- net.inet.tcp.delayed_ack
>
> - What threads/irq are allocated to your NIC? 'vmstat -i'
>
> - Are the above threads floating or mapped? 'cpuset -g ...'
>
> - Determine best settings for LRO/TSO for your card.
>
> - Disable nfs tcp drc
>
> - What is your atime setting?
>
>
> If you really think you have a ZFS/Kernel issue, and you're
> data fits in cache, dump ZFS, create a memory backed file system
> and repeat your tests. This will purge a large portion of your
> graph. LRO/TSO changes may do so also.
>
> You also state you are using a Linux client. Are you using
> the MLX affinity scripts, buffer sizing suggestions, etc, etc.
> Have you swapped the Linux system for a fbsd system?
>
> And as a final note, I regularly use Chelsio T62100 cards
> in dual home and/or LACP environments in Supermicro boxes with 100's
> of nfs boot (Bhyve, QEMU, and physical system) clients per server
> with no network starvation or cpu bottlenecks. Clients boot, perform
> their work, and then remotely request image rollback.
>
>
> Hopefully the above will help and provide pointers.
>
> Cheers
>