Re: low TCP speed, wrong rtt measurement

From: Cheng Cui <cc_at_freebsd.org>
Date: Sun, 09 Apr 2023 14:58:43 UTC
First of all, we need to make sure there are TCP retransmissions that are
caused by packet loss.
Otherwise, TCP congestion control or cwnd is irrelevant.

Some tests like below from iperf3 or "netstat -s" can report TCP
retransmissions.

For example, over a 20ms link, the theoretical max cwnd size is determined
by the
Bandwidth Delay Product (BDP):
20ms x 10Mb/s = 25000 Bytes (around 25KB)

cc@s1:~ % ping -c 3 r1
PING r1-link1 (10.1.1.3): 56 data bytes
64 bytes from 10.1.1.3: icmp_seq=0 ttl=64 time=19.807 ms
64 bytes from 10.1.1.3: icmp_seq=1 ttl=64 time=19.387 ms
64 bytes from 10.1.1.3: icmp_seq=2 ttl=64 time=19.488 ms

--- r1-link1 ping statistics ---
3 packets transmitted, 3 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 19.387/19.561/19.807/0.179 ms

before test:
cc@s1:~ % netstat -sp tcp | egrep "tcp:|retrans|SACK"
tcp:
0 data packets (0 bytes) retransmitted
0 data packets unnecessarily retransmitted
0 retransmit timeouts
0 retransmitted
0 SACK recovery episodes
0 segment rexmits in SACK recovery episodes
0 byte rexmits in SACK recovery episodes
0 SACK options (SACK blocks) received
0 SACK options (SACK blocks) sent
0 SACK retransmissions lost
0 SACK scoreboard overflow

cc@s1:~ % iperf3 -c r1 -t 5 -i 1
Connecting to host r1, port 5201
[  5] local 10.1.1.2 port 49487 connected to 10.1.1.3 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  2.58 MBytes  21.7 Mbits/sec    7   11.3 KBytes

[  5]   1.00-2.00   sec  1.39 MBytes  11.7 Mbits/sec    2   31.0 KBytes

[  5]   2.00-3.00   sec  1.14 MBytes  9.59 Mbits/sec    4   24.1 KBytes

[  5]   3.00-4.00   sec  1.01 MBytes  8.48 Mbits/sec    3   30.4 KBytes

[  5]   4.00-5.00   sec  1.33 MBytes  11.2 Mbits/sec    4   23.0 KBytes

- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-5.00   sec  7.46 MBytes  12.5 Mbits/sec   20             sender
[  5]   0.00-5.02   sec  7.23 MBytes  12.1 Mbits/sec
 receiver

iperf Done.

after test:
cc@s1:~ % netstat -sp tcp | egrep "tcp:|retrans|SACK"
tcp:
20 data packets (28960 bytes) retransmitted            <<
0 data packets unnecessarily retransmitted
0 retransmit timeouts
0 retransmitted
18 SACK recovery episodes
20 segment rexmits in SACK recovery episodes                   <<
28960 byte rexmits in SACK recovery episodes
598 SACK options (SACK blocks) received
0 SACK options (SACK blocks) sent
0 SACK retransmissions lost
0 SACK scoreboard overflow

> I've tried various transfer protocols: ftp, scp, rcp, http: results
> are similar for all.  Ping times for the closest WAN link is 2.3ms,
> furthest is 60ms.  On the furthest link, we get around 15%
> utilisation. Transfer between
> 2 Windows hosts on the furthest link yields ~80% utilisation.

Thus, theoretical max cwnd the sender can grow up to is:
2.3ms x 2Mb/s = 575 Byts
60ms  x 2Mb/s = 15000 Bytes (around 15KB)

Best Regards,
Cheng Cui


On Sun, Apr 9, 2023 at 5:31 AM Scheffenegger, Richard <
Richard.Scheffenegger@netapp.com> wrote:

> Hi,
>
> Adding fbsd-transport too.
>
> For stable-12, I believe all relevant (algorithm) improvements went in.
>
> However, 12.2 is missing D26807 and D26808 - improvements in Cubic to
> retransmission timeouts (but these are not material)
>
> While 12.1. has none of the improvements done in 2020 to the Cubic module
> - D18954, D18982, D19118, D23353, D23655, D25065, D25133, D25744, D24657,
> D25746, D25976, D26060, D26807, D26808.
>
> These should fix numerous issues in cubic, which would very likely make it
> perform poorly particularly on longer duration sessions.
>
> However, Cubic is heavily reliant on a valid measurement of RTT and the
> epoch since the last congestion response (measured in units of RTT). An
> issue in getting RTT measured properly would derail cubic for sure (most
> likely cubic would inflate cwnd much faster, then running into significant
> packet loss, very likely loss of retransmissions, followed by
> retransmission timeouts, and shrinking of the ssthresh to small values.
>
>
> I haven't looked into cc_vegas or the ertt module though.
>
> One more initial question: Are you using timestamps on that long, thin
> pipe - or is net.inet.tcp.rfc1323 disabled (more recent versions allow the
> selective enablement/disabling of window scaling and timestamps indepentend
> of each other, but I don't think this is in and 12 release. (D36863)?
>
> Finally, you could be using SIFTR to track the evolution of the minrtt
> value over the course of the session.
>
> Although I suspect ultimately a tcpdump including the tcp header (-s 80) ,
> and the sifter internal state evolution would be optimal to understanding
> when and why the RTT values go off the rails.
>
>
> At first glance, the ertt module may be prone to miscalculations, when
> retransmissions are in play - no special precautions appear to be present,
> to distinguish between the originally sent packet, and any retransmission,
> nor any filtering of ACKs which come in as duplicates. Thus there could be
> a scenario, where an ACK for a spurious retransmission, e.g. due to
> reordering, could lead to a wrong baseline RTT measurement, which is
> physically impossible on such a long distance connection...
>
> But again, I haven't looked into the ertt module so far at all.
>
> How are the base stack RTT related values look on these misbehaving
> sessions?
> Tcpcb-> t_rttmin, t_srtt, t_rttvar, t_rxtcur, t_rtttime, t_rtseq,
> t_rttlow, t_rttupdated
>
> Best regards,
>   Richard
>
>
>
>
> -----Original Message-----
> From: Rodney W. Grimes <freebsd-rwg@gndrsh.dnsmgr.net>
> Sent: Sonntag, 9. April 2023 02:59
> To: Richard Perini <rpp@ci.com.au>
> Cc: freebsd-hackers@FreeBSD.org; rscheff@FreeBSD.org
> Subject: Re: low TCP speed, wrong rtt measurement
>
> NetApp Security WARNING: This is an external email. Do not click links or
> open attachments unless you recognize the sender and know the content is
> safe.
>
>
>
>
> > On Tue, Apr 04, 2023 at 02:46:34PM -0000, Peter 'PMc' Much wrote:
> > > ** maybe this should rather go the -net list, but then
> > > ** there are only bug messages
> > >
> > > Hi,
> > >   I'm trying to transfer backup data via WAN; the link bandwidth is
> > > only ~2 Mbit, but this can well run for days and just saturate the
> > > spare bandwidth.
> > >
> > > The problem is, it doesn't saturate the bandwidth.
> > >
> > > I found that the backup application opens the socket in this way:
> > >       if ((fd = socket(ipaddr->GetFamily(), SOCK_STREAM, 0)) < 0) {
> > >
> > > Apparently that doesn't work well. So I patched the application to
> > > do it this way:
> > > -      if ((fd = socket(ipaddr->GetFamily(), SOCK_STREAM, 0)) < 0) {
> > > +      if ((fd = socket(ipaddr->GetFamily(), SOCK_STREAM,
> > > + IPPROTO_TCP)) < 0) {
> > >
> > > The result, observed with tcpdump, was now noticeably different, but
> > > rather worse than better.
> > >
> > > I tried various cc algorithms, all behaved very bad with the
> > > exception of cc_vegas. Vegas, after tuning the alpha and beta, gave
> > > satisfying results with less than 1% tradeoff.
> > >
> > > But only for a time. After transferring for a couple of hours the
> > > throughput went bad again:
> > >
> > > # netstat -aC
> > > Proto Recv-Q Send-Q Local Address          Foreign Address
> (state)     CC          cwin   ssthresh   MSS ECN
> > > tcp6       0  57351 edge-jo.26996          pole-n.22
> ESTABLISHED vegas      22203      10392  1311 off
> > > tcp4       0 106305 edge-e.62275           pole-n.bacula-sd
>  ESTABLISHED vegas      11943       5276  1331 off
> > >
> > > The first connection is freshly created. The second one runs for a
> > > day already , and it is obviousely hosed - it doesn't recover.
> > >
> > > # sysctl net.inet.tcp.cc.vegas
> > > net.inet.tcp.cc.vegas.beta: 14
> > > net.inet.tcp.cc.vegas.alpha: 8
> > >
> > > 8 (alpha) x 1331 (mss) = 10648
> > >
> > > The cwin is adjusted to precisely one tick above the alpha, and
> > > doesn't rise further. (Increasing the alpha further does solve the
> > > issue for this connection - but that is not how things are supposed
> > > to
> > > work.)
> > >
> > > Now I tried to look into the data that vegas would use for it's
> > > decisions, and found this:
> > >
> > > # dtrace -n 'fbt:kernel:vegas_ack_received:entry { printf("%s %u %d
> > > %d %d %d", execname,\ (*((struct tcpcb **)(arg0+24)))->snd_cwnd,\
> > > ((struct ertt *)((*((struct tcpcb
> > > **)(arg0+24)))->osd->osd_slots[0]))->minrtt,\
> > > ((struct ertt *)((*((struct tcpcb
> > > **)(arg0+24)))->osd->osd_slots[0]))->marked_snd_cwnd,\
> > > ((struct ertt *)((*((struct tcpcb
> > > **)(arg0+24)))->osd->osd_slots[0]))->bytes_tx_in_marked_rtt,\
> > > ((struct ertt *)((*((struct tcpcb
> > > **)(arg0+24)))->osd->osd_slots[0]))->markedpkt_rtt);\
> > > }'
> > > CPU     ID                    FUNCTION:NAME
> > >   6  17478         vegas_ack_received:entry ng_queue 11943 1 11943
> 10552 131
> > >  17  17478         vegas_ack_received:entry ng_queue 22203 56 22203
> 20784 261
> > >  17  17478         vegas_ack_received:entry ng_queue 22203 56 22203
> 20784 261
> > >   3  17478         vegas_ack_received:entry ng_queue 11943 1 11943
> 10552 131
> > >   5  17478         vegas_ack_received:entry ng_queue 22203 56 22203
> 20784 261
> > >  17  17478         vegas_ack_received:entry ng_queue 11943 1 11943
> 10552 131
> > >  11  17478         vegas_ack_received:entry ng_queue 11943 1 11943
> 10552 106
> > >  15  17478         vegas_ack_received:entry ng_queue 22203 56 22203
> 20784 261
> > >  13  17478         vegas_ack_received:entry ng_queue 22203 56 22203
> 20784 261
> > >  16  17478         vegas_ack_received:entry ng_queue 11943 1 11943
> 10552 106
> > >   3  17478         vegas_ack_received:entry ng_queue 22203 56 22203
> 20784 261
> > >
> > > One can see that the "minrtt" value for the freshly created
> > > connection is 56 (which is very plausible).
> > > But the old and hosed connection shows minrtt = 1, which explains
> > > the observed cwin.
> > >
> > > The minrtt gets calculated in sys/netinet/khelp/h_ertt.c:
> > >               e_t->rtt = tcp_ts_getticks() - txsi->tx_ts + 1; There
> > > is a "+1", so this was apparently zero.
> > >
> > > But source and destination are at least 1000 km apart. So either we
> > > have had one of the rare occasions of hyperspace tunnelling, or
> > > something is going wrong in the ertt measurement code.
> > >
> > > For now this is a one-time observation, but it might also explain
> > > why the other cc algorithms behaved badly. These algorithms are
> > > widely in use and should work - the ertt measurement however is the
> > > same for all of them.
> >
> > I can confirm I am seeing similar problems transferring files to our
> > various production sites around Australia. Various types/sizes of links
> and bandwidths.
> > I can saturate the nearby links, but the link utilisation/saturation
> > decreases with distance.
> >
> > I've tried various transfer protocols: ftp, scp, rcp, http: results
> > are similar for all.  Ping times for the closest WAN link is 2.3ms,
> > furthest is 60ms.  On the furthest link, we get around 15%
> > utilisation. Transfer between
> > 2 Windows hosts on the furthest link yields ~80% utilisation.
>
> Windows should be using cc_cubic, you say above you had tried all the
> congestion algorithims, and only cc_vegas after tuning gave good results.
>
> >
> > FreeBSD versions involved are 12.1 and 12.2.
>
> I wonder if cc_cubic is broken in 12.X, it should give similiar results to
> windows if things are working correctly.
>
> I am adding Richard Scheffenegger as he is the most recent expect on the
> congestion control code in FreeBSD.
>
> > --
> > Richard Perini
> > Ramico Australia Pty Ltd   Sydney, Australia   rpp@ci.com.au  +61 2
> 9552 5500
> > ----------------------------------------------------------------------
> > ------- "The difference between theory and practice is that in theory
> > there is no  difference, but in practice there is"
>
> --
> Rod Grimes
> rgrimes@freebsd.org
>