Re: FreeBSD TCP (with iperf3) comparison with Linux

From: Cheng Cui <cc_at_freebsd.org>
Date: Mon, 03 Jul 2023 20:24:00 UTC
I see. Sorry about a straight description in my previous email.

If you found the iperf3 report shows bad throughput and increasing numbers
in the "Retr" field, also the "netstat -sp tcp" shows retransmitted packets
without SACK recovery episodes (SACK is enabled by default). Then, you are
likely hitting the problem I described, and the root cause is the TX queue
drops. The tcpdump trace file won't show any packet retransmissions and the
peer won't be aware of packet loss, as this is a local problem.

cc@s1:~ % netstat -sp tcp | egrep "tcp:|retrans|SACK"
tcp:
139 data packets (300416 bytes) retransmitted       <<
0 data packets unnecessarily retransmitted
3 retransmit timeouts
0 retransmitted
0 SACK recovery episodes                                      <<
0 segment rexmits in SACK recovery episodes
0 byte rexmits in SACK recovery episodes
0 SACK options (SACK blocks) received
0 SACK options (SACK blocks) sent
0 SACK retransmissions lost
0 SACK scoreboard overflow

Local packet drops due to TX full can be found from this command, for
example
cc@s1:~ % netstat -i -I bce4 -nd Name Mtu Network Address Ipkts Ierrs Idrop
Opkts Oerrs Coll Drop bce4 1500 <Link#5> 00:10:18:56:94:d4 286184 0 0
148079 0 0 54 << bce4 - 10.1.1.0/24 10.1.1.2 286183 - - 582111 - - - cc@s1:~
%

Hope the above stats can help you better root cause analysis. Also,
increasing the TX queue size is a workaround and is specific to a
particular NIC. But you get the idea.

Best Regards,
Cheng Cui


On Mon, Jul 3, 2023 at 11:34 AM Murali Krishnamurthy <muralik1@vmware.com>
wrote:

> Cheng,
>
>
>
> Thanks for your inputs.
>
>
>
> Sorry, I am not familiar with this area.
>
>
>
> Few queries,
>
>
>
> “I believe the default values for bce tx/rx pages are 2. And I happened to
> find
> this problem before that when the tx queue was full, it would not enqueue
> packets
> and started return errors.
> And this error was misunderstood by the TCP layer as retransmission.”
>
>
>
> Could you please elaborate what is misunderstood by TCP here? Loss of
> packets should anyway lead to retransmissions.
>
>
>
> Could you point to some stats where I can see such drops due to queue
> getting full?
>
>
>
> I have a vmx interface in my VM and I have attached the screenshot of
> ifconfig command for that.
>
> Anything we can understand from that?
>
> Will your suggestion of increasing tx_pages=4 and rx_pages=4 work for this
> ? If so, I assume names would be hw.vmx.tx_pages=4 and hw.vmx.rx_pages ?
>
>
>
> Regards
>
> Murali
>
>
>
>
>
> *From: *Cheng Cui <cc@freebsd.org>
> *Date: *Friday, 30 June 2023 at 10:02 PM
> *To: *Murali Krishnamurthy <muralik1@vmware.com>
> *Cc: *Scheffenegger, Richard <rscheff@freebsd.org>, FreeBSD Transport <
> freebsd-transport@freebsd.org>
> *Subject: *Re: FreeBSD TCP (with iperf3) comparison with Linux
>
> *!! External Email*
>
> I used an emulation testbed from Emulab.net with Dummynet traffic shaper
> adding 100ms RTT
> between two nodes, the link capacity is 1Gbps and both nodes are using
> freebsd13.2.
>
> cc@s1:~ % ping -c 3 r1
> PING r1-link1 (10.1.1.3): 56 data bytes
> 64 bytes from 10.1.1.3: icmp_seq=0 ttl=64 time=100.091 ms
> 64 bytes from 10.1.1.3: icmp_seq=1 ttl=64 time=99.995 ms
> 64 bytes from 10.1.1.3: icmp_seq=2 ttl=64 time=99.979 ms
>
> --- r1-link1 ping statistics ---
> 3 packets transmitted, 3 packets received, 0.0% packet loss
> round-trip min/avg/max/stddev = 99.979/100.022/100.091/0.049 ms
>
>
> cc@s1:~ % iperf3 -c r1 -t 10 -i 1 -C cubic
> Connecting to host r1, port 5201
> [  5] local 10.1.1.2 port 56089 connected to 10.1.1.3 port 5201
> [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> [  5]   0.00-1.00   sec  4.19 MBytes  35.2 Mbits/sec    0   1.24 MBytes
>
> [  5]   1.00-2.00   sec  56.5 MBytes   474 Mbits/sec    6   2.41 MBytes
>
> [  5]   2.00-3.00   sec  58.6 MBytes   492 Mbits/sec   18   7.17 MBytes
>
> [  5]   3.00-4.00   sec  65.6 MBytes   550 Mbits/sec   14    606 KBytes
>
> [  5]   4.00-5.00   sec  60.8 MBytes   510 Mbits/sec   18   7.22 MBytes
>
> [  5]   5.00-6.00   sec  62.1 MBytes   521 Mbits/sec   12   7.86 MBytes
>
> [  5]   6.00-7.00   sec  60.9 MBytes   512 Mbits/sec   14   3.43 MBytes
>
> [  5]   7.00-8.00   sec  62.8 MBytes   527 Mbits/sec   16    372 KBytes
>
> [  5]   8.00-9.00   sec  59.3 MBytes   497 Mbits/sec   14   1.77 MBytes
>
> [  5]   9.00-10.00  sec  57.0 MBytes   477 Mbits/sec   18   7.13 MBytes
>
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bitrate         Retr
> [  5]   0.00-10.00  sec   548 MBytes   459 Mbits/sec  130
> sender
> [  5]   0.00-10.10  sec   540 MBytes   449 Mbits/sec
>  receiver
>
> iperf Done.
>
> cc@s1:~ % ifconfig bce4
> bce4: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
>
> options=c01bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,VLAN_HWTSO,LINKSTATE>
> ether 00:10:18:56:94:d4
> inet 10.1.1.2 netmask 0xffffff00 broadcast 10.1.1.255
> media: Ethernet 1000baseT <full-duplex>
> status: active
> nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
>
> I believe the default values for bce tx/rx pages are 2. And I happened to
> find
> this problem before that when the tx queue was full, it would not enqueue
> packets
> and started return errors.
> And this error was misunderstood by the TCP layer as retransmission.
>
> After adding hw.bce.tx_pages=4 and hw.bce.rx_pages=4 in /boot/loader.conf
> and reboot:
>
> cc@s1:~ % iperf3 -c r1 -t 10 -i 1 -C cubic
> Connecting to host r1, port 5201
> [  5] local 10.1.1.2 port 20478 connected to 10.1.1.3 port 5201
> [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> [  5]   0.00-1.00   sec  4.15 MBytes  34.8 Mbits/sec    0   1.17 MBytes
>
> [  5]   1.00-2.00   sec  83.1 MBytes   697 Mbits/sec    0   12.2 MBytes
>
> [  5]   2.00-3.00   sec   112 MBytes   939 Mbits/sec    0   12.2 MBytes
>
> [  5]   3.00-4.00   sec   113 MBytes   944 Mbits/sec    0   12.2 MBytes
>
> [  5]   4.00-5.00   sec   112 MBytes   940 Mbits/sec    0   12.2 MBytes
>
> [  5]   5.00-6.00   sec   112 MBytes   942 Mbits/sec    0   12.2 MBytes
>
> [  5]   6.00-7.00   sec   112 MBytes   938 Mbits/sec    0   12.2 MBytes
>
> [  5]   7.00-8.00   sec   113 MBytes   944 Mbits/sec    0   12.2 MBytes
>
> [  5]   8.00-9.00   sec   112 MBytes   938 Mbits/sec    0   12.2 MBytes
>
> [  5]   9.00-10.00  sec   113 MBytes   947 Mbits/sec    0   12.2 MBytes
>
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bitrate         Retr
> [  5]   0.00-10.00  sec   985 MBytes   826 Mbits/sec    0
> sender
> [  5]   0.00-10.11  sec   982 MBytes   815 Mbits/sec
>  receiver
>
> iperf Done.
>
>
>
> Best Regards,
>
> Cheng Cui
>
>
>
>
>
> On Fri, Jun 30, 2023 at 12:26 PM Murali Krishnamurthy <muralik1@vmware.com>
> wrote:
>
> Richard,
>
>
>
> Appreciate the useful inputs you have shared so far. Will try to figure
> out regarding packet drops.
>
>
>
> Regarding HyStart, I see even BSD code base has support for this. May I
> know by when can we see that in an release, if not already available ?
>
>
>
> Regarding this point : *“Switching to other cc modules may give some more
> insights. But again, I suspect that momentary (microsecond) burstiness of
> BSD may be causing this significantly higher loss rate.”*
>
> Is there some info somewhere where I can understand more on this in detail?
>
>
>
> Regards
>
> Murali
>
>
>
>
>
> On 30/06/23, 9:35 PM, "owner-freebsd-transport@freebsd.org" <
> owner-freebsd-transport@freebsd.org> wrote:
>
>
>
> Hi Murali,
>
>
>
> > Q. Since you mention two hypervisors - what is the phyiscal network
> topology in between these two servers? What theoretical link rates would be
> attainable?
>
> >
>
> > Here is the topology
>
> >
>
> > Iperf end points are on 2 different hypervisors.
>
> >
>
> >  ———————————        ————————————————
> ——————                ——————-—
>
> > | Linux VM1 |      |  BSD 13 VM
> 1  |
> |  Linux VM2  |                |  BSD 13 VM 2  |
>
> > |___________|      |_ ____ ____ ___
> |                                                                                  |___________
> |                |_ ____ ____ ___ |
>
> > |          |
> |
> |                                   |
>
> >
> |                          |
> |                                   |
>
> >
> ———————————————                                                                                  ———————————————
>
> > |           ESX Hypervisor 1          |           10G link connected via
> L2 Switch                      |           ESX Hypervisor  2            |
>
> > |
> |————————————————————————
> |                                                |
>
> > |——————————————
> |
> |——————————————|
>
> >
>
> >
>
> > Nic is of 10G capacity on both ESX server and it has below config.
>
>
>
>
>
> So, when both VMs run on the same Hypervisor, maybe with another VM to
> simulate the 100ms delay, can you attain a lossless baseline scenario?
>
>
>
>
>
> > BDP for 16MB Socket buffer: 16 MB * (1000 ms * 100ms latency) * 8 bits/
> 1024 = 1.25 Gbps
>
> >
>
> > So theoretically we should see close to 1.25Gbps of Bitrate and we see
> Linux reaching close to this number.
>
>
>
> Under no loss, yes.
>
>
>
>
>
> > But BSD is not able to do that.
>
> >
>
> >
>
> > Q. Did you run iperf3? Did the transmitting endpoint report any
> retransmissions between Linux or FBSD hosts?
>
> >
>
> > Yes, we used iper3. I see Linux doing less number retransmissions
> compared to BSD.
>
> > On BSD, the best performance was around 600 Mbps bitrate and the number
> of retransmissions for this number seen is around 32K
>
> > On Linux, the best performance was around 1.15 Gbps bitrate and the
> number of retransmissions for this number seen is only 2K.
>
> > So as you pointed the number of retransmissions in BSD could be the real
> issue here.
>
>
>
> There are other cc modules available; but I believe one major deviation is
> that Linux can perform mechanisms like hystart; ACKing every packet when
> the client detects slow start; perform pacing to achieve more uniform
> packet transmissions.
>
>
>
> I think the next step would be to find out, at which queue those packet
> discards are coming from (external switch? delay generator? Vswitch? Eth
> stack inside the VM?)
>
>
>
> Or alternatively, provide your ESX hypervisors with vastly more link
> speed, to rule out any L2 induced packet drops - provided your delay
> generator is not the source when momentarily overloaded.
>
>
>
> > Is there a way to reduce this packet loss by fine tuning some parameters
> w.r.t ring buffer or any other areas?
>
>
>
> Finding where these arise (looking at queue and port counters) would be
> the next step. But this is not really my specific area of expertise beyond
> the high level, vendor independent observations.
>
>
>
> Switching to other cc modules may give some more insights. But again, I
> suspect that momentary (microsecond) burstiness of BSD may be causing this
> significantly higher loss rate.
>
>
>
> TCP RACK would be another option. That stack has pacing, more fine-grained
> timing, the RACK loss recovery mechanisms etc. Maybe that helps reduce the
> observed packet drops by iperf, and consequently, yield a higher overall
> throuhgput.
>
>
>
>
>
>
>
>
>
>
>
> *!! External Email:* This email originated from outside of the
> organization. Do not click links or open attachments unless you recognize
> the sender.
>
>
>