Re: 100Gb performance
- Reply: Mark Saad : "Re: 100Gb performance"
- In reply to: Olivier_Cochard-Labbé : "Re: 100Gb performance"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Thu, 19 Jun 2025 13:12:51 UTC
Hi Olivier The problem I have it’s that one end is an iscilon storage, not much I can do there. We are running several rsyncs in parallel, getting some better throughput, but nowhere near the 10GB/, actually just barely touching 1GB/s Thanks, Danny > On 19 Jun 2025, at 14:55, Olivier Cochard-Labbé <olivier@freebsd.org> wrote: > > > On Thu, Jun 19, 2025 at 8:20 AM Daniel Braniss <danny@cs.huji.ac.il <mailto:danny@cs.huji.ac.il>> wrote: >> hi, >> >> i am running 14.2 on a DELL PowerEdge R750 with a mellanox/nvidia 100Gb nic mlx5en: >> >> mce0: flags=1008843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST,LOWER_UP> metric 0 mtu 1500 >> >> options=66ef07bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,VLAN_HWFILTER,NV,VLAN_HWTSO,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6,HWSTATS,HWRXTSTMP,MEXTPG,VXLAN_HWCSUM,VXLAN_HWTSO> >> ether ... >> inet ... netmask 0xfffffc00 broadcast …. >> media: Ethernet 100GBase-KR4 <full-duplex,rxpause,txpause> >> status: active >> nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL> >> >> I’m doing a rsync from an iscsilon mounted via this mce0, the best max throughput is about 1GBs which is a bit depressing >> > > Regarding this 1GB/s (8Gb/s) it is how I get on my side with a very simple netcat transfert. > By simple transfert, I mean using one TCP flow with a single process netcat: > > On the receiver host: > nc -l 12345 > /dev/null > On the sender host: > dd if=/dev/zero bs=1G count=100 | nc 1.1.1.30 12345 > 107374182400 bytes transferred in 77.772515 secs (1380618626 bytes/sec) > > Which is about 1.3GB/s, so close to your 1GB/s. > > Let’s dig a little more and one the sender, displaying the stats for each NIC queues. > How many queues the drivers configured on my sender system ? > # sysctl dev.mce.0.conf.channels > dev.mce.0.conf.channels: 40 > So 40 queues (match my output of nproc), great. > But how many were used during this test: > # sysctl dev.mce.0 | awk '/txstat.*\.bytes/ && $NF != 0' > dev.mce.0.txstat26tc0.bytes: 74 > dev.mce.0.txstat20tc0.bytes: 120 > dev.mce.0.txstat4tc0.bytes: 112564632016 > dev.mce.0.txstat0tc0.bytes: 60 > > => Only one queue (the number 4 in my example) is used. > And it is the same problem on the receiver: One queue/one core > > Let’s improving this by running 8 parallels nc at the same time, but we need to use 8 different TCP sessions to let RSS selecting 8 differents queues: > On the receiver host: > nc -l 12341 > /dev/null & > nc -l 12342 > /dev/null & > nc -l 12343 > /dev/null & > nc -l 12344 > /dev/null & > nc -l 12345 > /dev/null & > nc -l 12346 > /dev/null & > nc -l 12347 > /dev/null & > nc -l 12348 > /dev/null > > On the sender host: > dd if=/dev/zero bs=1G count=100 | nc 1.1.1.30 12341 & > dd if=/dev/zero bs=1G count=100 | nc 1.1.1.30 12342 & > dd if=/dev/zero bs=1G count=100 | nc 1.1.1.30 12343 & > dd if=/dev/zero bs=1G count=100 | nc 1.1.1.30 12344 & > dd if=/dev/zero bs=1G count=100 | nc 1.1.1.30 12345 & > dd if=/dev/zero bs=1G count=100 | nc 1.1.1.30 12346 & > dd if=/dev/zero bs=1G count=100 | nc 1.1.1.30 12347 & > dd if=/dev/zero bs=1G count=100 | nc 1.1.1.30 12348 > > Then we need to add the output from all those dd: > 107374182400 bytes transferred in 103.937552 secs (1033064374 bytes/sec) > 107374182400 bytes transferred in 104.474689 secs (1027753071 bytes/sec) > 107374182400 bytes transferred in 104.939627 secs (1023199578 bytes/sec) > 107374182400 bytes transferred in 105.002306 secs (1022588806 bytes/sec) > 107374182400 bytes transferred in 105.674894 secs (1016080345 bytes/sec) > 107374182400 bytes transferred in 105.687319 secs (1015960885 bytes/sec) > 107374182400 bytes transferred in 106.480994 secs (1008388239 bytes/sec) > 107374182400 bytes transferred in 106.837954 secs (1005019084 bytes/sec) > > To have a total of 8152054382 bytes/sec (8.15 GBytes/s or 65Gb/s). > You can check the stats per queue, and you will notice that 8 of them should have been used. > So you need to use a multi-threaded/parallel rsync equivalent (on both sides) to fill your link. > > >> but tcpdump -i mce0 says: >> store-09# tcpdump -i mce0 host <same net as mce0> >> tcpdump: verbose output suppressed, use -v[v]... for full protocol decode >> listening on mce0, link-type EN10MB (Ethernet), snapshot length 262144 bytes >> ********** >> > > Don’t worry about the libpcap definition, from contrib/libpcap/pcap/dlt.h : > #define DLT_NULL 0 /* BSD loopback encapsulation */ > #define DLT_EN10MB 1 /* Ethernet (10Mb) */ > #define DLT_EN3MB 2 /* Experimental Ethernet (3Mb) */ > #define DLT_AX25 3 /* Amateur Radio AX.25 */ > #define DLT_PRONET 4 /* Proteon ProNET Token Ring */ > #define DLT_CHAOS 5 /* Chaos */ > #define DLT_IEEE802 6 /* 802.5 Token Ring */ > #define DLT_ARCNET 7 /* ARCNET, with BSD-style header */ > #define DLT_SLIP 8 /* Serial Line IP */ > #define DLT_PPP 9 /* Point-to-point Protocol */ > #define DLT_FDDI 10 /* FDDI */ > > So the EN10MB is simply the term used for "Ethernet". > > Regards, > Olivier