From nobody Sun Apr 09 14:58:43 2023 X-Original-To: freebsd-hackers@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4PvZxb6nKTz44Z1f; Sun, 9 Apr 2023 14:58:59 +0000 (UTC) (envelope-from ccfreebsd@gmail.com) Received: from mail-lf1-f51.google.com (mail-lf1-f51.google.com [209.85.167.51]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1D4" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4PvZxZ6ppqz3R1W; Sun, 9 Apr 2023 14:58:58 +0000 (UTC) (envelope-from ccfreebsd@gmail.com) Authentication-Results: mx1.freebsd.org; none Received: by mail-lf1-f51.google.com with SMTP id r27so3758797lfe.0; Sun, 09 Apr 2023 07:58:58 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1681052336; x=1683644336; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=+b+O2Yom3wqEl+9wsw4DjgDCVc0AlQOWJqCq59vBQdo=; b=M4hy9fpxTpFLxjg1htENSulZgxxG1AxVSyS3tumc9So/omA/g11qNvU+HB9H6BT9MO 9dY53/y9SuVlgyGit1V60vzsabIpImwZRYxLHwEVyLa7n8X7f1bHqWaLVS97cHFAUwyD P5xJY+oBHmWyF6iXJQftjhjh6e5y6t4wnPSN2b597AZwYdj1fcLGt+K0JZZ6+J/KcQ22 ntQte05dZnrxHhxof523o9P9bYR6eG2+Li5zYCQRY+tnj4EoFFp2VQwawtoDL/rlj+8g Cz5np9nzFi4mw4oBx0edeS2vuwaZgzop1elNl0WeFrD7nL2B4FtHQBEHxEpK851eA4jI Sr+w== X-Gm-Message-State: AAQBX9fKWRGnK4zxPgwcVwIjND4aDz1rtMM8cSU0+yoSCtYz0ZNCHYXF ApXNz2Plc05y7xONI4gqScNrYITtsFbtAQ== X-Google-Smtp-Source: AKy350Y2CpYhPpQz3TelUCAyG6aqu0pLZ5oLF1EiHQRxm/HZvMToTqhNKxYetu5I1tMJ+V0ffoZm3A== X-Received: by 2002:ac2:5231:0:b0:4e8:3ea0:cf37 with SMTP id i17-20020ac25231000000b004e83ea0cf37mr2851116lfl.34.1681052335540; Sun, 09 Apr 2023 07:58:55 -0700 (PDT) Received: from mail-lf1-f43.google.com (mail-lf1-f43.google.com. [209.85.167.43]) by smtp.gmail.com with ESMTPSA id a4-20020ac25204000000b004d86808fd33sm1656826lfl.15.2023.04.09.07.58.54 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Sun, 09 Apr 2023 07:58:55 -0700 (PDT) Received: by mail-lf1-f43.google.com with SMTP id h11so52841708lfu.8; Sun, 09 Apr 2023 07:58:54 -0700 (PDT) X-Received: by 2002:ac2:5385:0:b0:4e9:8c46:32ad with SMTP id g5-20020ac25385000000b004e98c4632admr2189475lfh.9.1681052334533; Sun, 09 Apr 2023 07:58:54 -0700 (PDT) List-Id: Technical discussions relating to FreeBSD List-Archive: https://lists.freebsd.org/archives/freebsd-hackers List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-hackers@freebsd.org MIME-Version: 1.0 References: <202304090058.3390wrE1020757@gndrsh.dnsmgr.net> In-Reply-To: From: Cheng Cui Date: Sun, 9 Apr 2023 10:58:43 -0400 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: low TCP speed, wrong rtt measurement To: "Scheffenegger, Richard" Cc: "Rodney W. Grimes" , Richard Perini , "freebsd-hackers@FreeBSD.org" , "rscheff@FreeBSD.org" , "tuexen@freebsd.org" , "" Content-Type: multipart/alternative; boundary="0000000000003fb16405f8e87f20" X-Rspamd-Queue-Id: 4PvZxZ6ppqz3R1W X-Spamd-Bar: ---- X-Spamd-Result: default: False [-4.00 / 15.00]; REPLY(-4.00)[]; ASN(0.00)[asn:15169, ipnet:209.85.128.0/17, country:US] X-Rspamd-Pre-Result: action=no action; module=replies; Message is reply to one we originated X-ThisMailContainsUnwantedMimeParts: N --0000000000003fb16405f8e87f20 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable First of all, we need to make sure there are TCP retransmissions that are caused by packet loss. Otherwise, TCP congestion control or cwnd is irrelevant. Some tests like below from iperf3 or "netstat -s" can report TCP retransmissions. For example, over a 20ms link, the theoretical max cwnd size is determined by the Bandwidth Delay Product (BDP): 20ms x 10Mb/s =3D 25000 Bytes (around 25KB) cc@s1:~ % ping -c 3 r1 PING r1-link1 (10.1.1.3): 56 data bytes 64 bytes from 10.1.1.3: icmp_seq=3D0 ttl=3D64 time=3D19.807 ms 64 bytes from 10.1.1.3: icmp_seq=3D1 ttl=3D64 time=3D19.387 ms 64 bytes from 10.1.1.3: icmp_seq=3D2 ttl=3D64 time=3D19.488 ms --- r1-link1 ping statistics --- 3 packets transmitted, 3 packets received, 0.0% packet loss round-trip min/avg/max/stddev =3D 19.387/19.561/19.807/0.179 ms before test: cc@s1:~ % netstat -sp tcp | egrep "tcp:|retrans|SACK" tcp: 0 data packets (0 bytes) retransmitted 0 data packets unnecessarily retransmitted 0 retransmit timeouts 0 retransmitted 0 SACK recovery episodes 0 segment rexmits in SACK recovery episodes 0 byte rexmits in SACK recovery episodes 0 SACK options (SACK blocks) received 0 SACK options (SACK blocks) sent 0 SACK retransmissions lost 0 SACK scoreboard overflow cc@s1:~ % iperf3 -c r1 -t 5 -i 1 Connecting to host r1, port 5201 [ 5] local 10.1.1.2 port 49487 connected to 10.1.1.3 port 5201 [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.00 sec 2.58 MBytes 21.7 Mbits/sec 7 11.3 KBytes [ 5] 1.00-2.00 sec 1.39 MBytes 11.7 Mbits/sec 2 31.0 KBytes [ 5] 2.00-3.00 sec 1.14 MBytes 9.59 Mbits/sec 4 24.1 KBytes [ 5] 3.00-4.00 sec 1.01 MBytes 8.48 Mbits/sec 3 30.4 KBytes [ 5] 4.00-5.00 sec 1.33 MBytes 11.2 Mbits/sec 4 23.0 KBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-5.00 sec 7.46 MBytes 12.5 Mbits/sec 20 sende= r [ 5] 0.00-5.02 sec 7.23 MBytes 12.1 Mbits/sec receiver iperf Done. after test: cc@s1:~ % netstat -sp tcp | egrep "tcp:|retrans|SACK" tcp: 20 data packets (28960 bytes) retransmitted << 0 data packets unnecessarily retransmitted 0 retransmit timeouts 0 retransmitted 18 SACK recovery episodes 20 segment rexmits in SACK recovery episodes << 28960 byte rexmits in SACK recovery episodes 598 SACK options (SACK blocks) received 0 SACK options (SACK blocks) sent 0 SACK retransmissions lost 0 SACK scoreboard overflow > I've tried various transfer protocols: ftp, scp, rcp, http: results > are similar for all. Ping times for the closest WAN link is 2.3ms, > furthest is 60ms. On the furthest link, we get around 15% > utilisation. Transfer between > 2 Windows hosts on the furthest link yields ~80% utilisation. Thus, theoretical max cwnd the sender can grow up to is: 2.3ms x 2Mb/s =3D 575 Byts 60ms x 2Mb/s =3D 15000 Bytes (around 15KB) Best Regards, Cheng Cui On Sun, Apr 9, 2023 at 5:31=E2=80=AFAM Scheffenegger, Richard < Richard.Scheffenegger@netapp.com> wrote: > Hi, > > Adding fbsd-transport too. > > For stable-12, I believe all relevant (algorithm) improvements went in. > > However, 12.2 is missing D26807 and D26808 - improvements in Cubic to > retransmission timeouts (but these are not material) > > While 12.1. has none of the improvements done in 2020 to the Cubic module > - D18954, D18982, D19118, D23353, D23655, D25065, D25133, D25744, D24657, > D25746, D25976, D26060, D26807, D26808. > > These should fix numerous issues in cubic, which would very likely make i= t > perform poorly particularly on longer duration sessions. > > However, Cubic is heavily reliant on a valid measurement of RTT and the > epoch since the last congestion response (measured in units of RTT). An > issue in getting RTT measured properly would derail cubic for sure (most > likely cubic would inflate cwnd much faster, then running into significan= t > packet loss, very likely loss of retransmissions, followed by > retransmission timeouts, and shrinking of the ssthresh to small values. > > > I haven't looked into cc_vegas or the ertt module though. > > One more initial question: Are you using timestamps on that long, thin > pipe - or is net.inet.tcp.rfc1323 disabled (more recent versions allow th= e > selective enablement/disabling of window scaling and timestamps indepente= nd > of each other, but I don't think this is in and 12 release. (D36863)? > > Finally, you could be using SIFTR to track the evolution of the minrtt > value over the course of the session. > > Although I suspect ultimately a tcpdump including the tcp header (-s 80) = , > and the sifter internal state evolution would be optimal to understanding > when and why the RTT values go off the rails. > > > At first glance, the ertt module may be prone to miscalculations, when > retransmissions are in play - no special precautions appear to be present= , > to distinguish between the originally sent packet, and any retransmission= , > nor any filtering of ACKs which come in as duplicates. Thus there could b= e > a scenario, where an ACK for a spurious retransmission, e.g. due to > reordering, could lead to a wrong baseline RTT measurement, which is > physically impossible on such a long distance connection... > > But again, I haven't looked into the ertt module so far at all. > > How are the base stack RTT related values look on these misbehaving > sessions? > Tcpcb-> t_rttmin, t_srtt, t_rttvar, t_rxtcur, t_rtttime, t_rtseq, > t_rttlow, t_rttupdated > > Best regards, > Richard > > > > > -----Original Message----- > From: Rodney W. Grimes > Sent: Sonntag, 9. April 2023 02:59 > To: Richard Perini > Cc: freebsd-hackers@FreeBSD.org; rscheff@FreeBSD.org > Subject: Re: low TCP speed, wrong rtt measurement > > NetApp Security WARNING: This is an external email. Do not click links or > open attachments unless you recognize the sender and know the content is > safe. > > > > > > On Tue, Apr 04, 2023 at 02:46:34PM -0000, Peter 'PMc' Much wrote: > > > ** maybe this should rather go the -net list, but then > > > ** there are only bug messages > > > > > > Hi, > > > I'm trying to transfer backup data via WAN; the link bandwidth is > > > only ~2 Mbit, but this can well run for days and just saturate the > > > spare bandwidth. > > > > > > The problem is, it doesn't saturate the bandwidth. > > > > > > I found that the backup application opens the socket in this way: > > > if ((fd =3D socket(ipaddr->GetFamily(), SOCK_STREAM, 0)) < 0) { > > > > > > Apparently that doesn't work well. So I patched the application to > > > do it this way: > > > - if ((fd =3D socket(ipaddr->GetFamily(), SOCK_STREAM, 0)) < 0) = { > > > + if ((fd =3D socket(ipaddr->GetFamily(), SOCK_STREAM, > > > + IPPROTO_TCP)) < 0) { > > > > > > The result, observed with tcpdump, was now noticeably different, but > > > rather worse than better. > > > > > > I tried various cc algorithms, all behaved very bad with the > > > exception of cc_vegas. Vegas, after tuning the alpha and beta, gave > > > satisfying results with less than 1% tradeoff. > > > > > > But only for a time. After transferring for a couple of hours the > > > throughput went bad again: > > > > > > # netstat -aC > > > Proto Recv-Q Send-Q Local Address Foreign Address > (state) CC cwin ssthresh MSS ECN > > > tcp6 0 57351 edge-jo.26996 pole-n.22 > ESTABLISHED vegas 22203 10392 1311 off > > > tcp4 0 106305 edge-e.62275 pole-n.bacula-sd > ESTABLISHED vegas 11943 5276 1331 off > > > > > > The first connection is freshly created. The second one runs for a > > > day already , and it is obviousely hosed - it doesn't recover. > > > > > > # sysctl net.inet.tcp.cc.vegas > > > net.inet.tcp.cc.vegas.beta: 14 > > > net.inet.tcp.cc.vegas.alpha: 8 > > > > > > 8 (alpha) x 1331 (mss) =3D 10648 > > > > > > The cwin is adjusted to precisely one tick above the alpha, and > > > doesn't rise further. (Increasing the alpha further does solve the > > > issue for this connection - but that is not how things are supposed > > > to > > > work.) > > > > > > Now I tried to look into the data that vegas would use for it's > > > decisions, and found this: > > > > > > # dtrace -n 'fbt:kernel:vegas_ack_received:entry { printf("%s %u %d > > > %d %d %d", execname,\ (*((struct tcpcb **)(arg0+24)))->snd_cwnd,\ > > > ((struct ertt *)((*((struct tcpcb > > > **)(arg0+24)))->osd->osd_slots[0]))->minrtt,\ > > > ((struct ertt *)((*((struct tcpcb > > > **)(arg0+24)))->osd->osd_slots[0]))->marked_snd_cwnd,\ > > > ((struct ertt *)((*((struct tcpcb > > > **)(arg0+24)))->osd->osd_slots[0]))->bytes_tx_in_marked_rtt,\ > > > ((struct ertt *)((*((struct tcpcb > > > **)(arg0+24)))->osd->osd_slots[0]))->markedpkt_rtt);\ > > > }' > > > CPU ID FUNCTION:NAME > > > 6 17478 vegas_ack_received:entry ng_queue 11943 1 11943 > 10552 131 > > > 17 17478 vegas_ack_received:entry ng_queue 22203 56 22203 > 20784 261 > > > 17 17478 vegas_ack_received:entry ng_queue 22203 56 22203 > 20784 261 > > > 3 17478 vegas_ack_received:entry ng_queue 11943 1 11943 > 10552 131 > > > 5 17478 vegas_ack_received:entry ng_queue 22203 56 22203 > 20784 261 > > > 17 17478 vegas_ack_received:entry ng_queue 11943 1 11943 > 10552 131 > > > 11 17478 vegas_ack_received:entry ng_queue 11943 1 11943 > 10552 106 > > > 15 17478 vegas_ack_received:entry ng_queue 22203 56 22203 > 20784 261 > > > 13 17478 vegas_ack_received:entry ng_queue 22203 56 22203 > 20784 261 > > > 16 17478 vegas_ack_received:entry ng_queue 11943 1 11943 > 10552 106 > > > 3 17478 vegas_ack_received:entry ng_queue 22203 56 22203 > 20784 261 > > > > > > One can see that the "minrtt" value for the freshly created > > > connection is 56 (which is very plausible). > > > But the old and hosed connection shows minrtt =3D 1, which explains > > > the observed cwin. > > > > > > The minrtt gets calculated in sys/netinet/khelp/h_ertt.c: > > > e_t->rtt =3D tcp_ts_getticks() - txsi->tx_ts + 1; There > > > is a "+1", so this was apparently zero. > > > > > > But source and destination are at least 1000 km apart. So either we > > > have had one of the rare occasions of hyperspace tunnelling, or > > > something is going wrong in the ertt measurement code. > > > > > > For now this is a one-time observation, but it might also explain > > > why the other cc algorithms behaved badly. These algorithms are > > > widely in use and should work - the ertt measurement however is the > > > same for all of them. > > > > I can confirm I am seeing similar problems transferring files to our > > various production sites around Australia. Various types/sizes of links > and bandwidths. > > I can saturate the nearby links, but the link utilisation/saturation > > decreases with distance. > > > > I've tried various transfer protocols: ftp, scp, rcp, http: results > > are similar for all. Ping times for the closest WAN link is 2.3ms, > > furthest is 60ms. On the furthest link, we get around 15% > > utilisation. Transfer between > > 2 Windows hosts on the furthest link yields ~80% utilisation. > > Windows should be using cc_cubic, you say above you had tried all the > congestion algorithims, and only cc_vegas after tuning gave good results. > > > > > FreeBSD versions involved are 12.1 and 12.2. > > I wonder if cc_cubic is broken in 12.X, it should give similiar results t= o > windows if things are working correctly. > > I am adding Richard Scheffenegger as he is the most recent expect on the > congestion control code in FreeBSD. > > > -- > > Richard Perini > > Ramico Australia Pty Ltd Sydney, Australia rpp@ci.com.au +61 2 > 9552 5500 > > ---------------------------------------------------------------------- > > ------- "The difference between theory and practice is that in theory > > there is no difference, but in practice there is" > > -- > Rod Grimes > rgrimes@freebsd.org > --0000000000003fb16405f8e87f20 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
First of all, we need to make sure there are TCP retransmi= ssions that are caused by packet loss.
Otherwise, TCP congestion control= or cwnd is irrelevant.

Some tests like below from iperf3 or "n= etstat -s" can report TCP retransmissions.

For example, over a = 20ms link, the theoretical max cwnd size is determined by the
Bandwidth = Delay Product (BDP):
20ms x 10Mb/s =3D 25000 Bytes (around 25KB)

cc@s1:~ % ping -c 3 r1
PING r1-link1 (10.1.1.3): = 56 data bytes
64 bytes from 10.1.1.3: ic= mp_seq=3D0 ttl=3D64 time=3D19.807 ms
64 bytes from 10.1.1.3: icmp_seq=3D1 ttl=3D64 time=3D19.387 ms
64 bytes fro= m 10.1.1.3: icmp_seq=3D2 ttl=3D64 time=3D19= .488 ms

--- r1-link1 ping statistics ---
3 packets transmitted, 3= packets received, 0.0% packet loss
round-trip min/avg/max/stddev =3D 19= .387/19.561/19.807/0.179 ms

before test:
cc@s1:~ % netstat -sp tc= p | egrep "tcp:|retrans|SACK"
tcp:
0 data packets (0 byte= s) retransmitted
0 data packets unnecessarily retransmitted
0 retr= ansmit timeouts
0 retransmitted
0 SACK recovery episodes
0 seg= ment rexmits in SACK recovery episodes
0 byte rexmits in SACK recovery = episodes
0 SACK options (SACK blocks) received
0 SACK options (SACK= blocks) sent
0 SACK retransmissions lost
0 SACK scoreboard overflo= w

cc@s1:~ % iperf3 -c r1 -t 5 -i 1
Connecting to host r1, port 5= 201
[ =C2=A05] local 10.1.1.2 port 49487 connected to 10.1.1.3 port 5201=
[ ID] Interval =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 Transfer =C2=A0 =C2= =A0 Bitrate =C2=A0 =C2=A0 =C2=A0 =C2=A0 Retr =C2=A0Cwnd
[ =C2=A05] =C2= =A0 0.00-1.00 =C2=A0 sec =C2=A02.58 MBytes =C2=A021.7 Mbits/sec =C2=A0 =C2= =A07 =C2=A0 11.3 KBytes =C2=A0 =C2=A0 =C2=A0
[ =C2=A05] =C2=A0 1.00-2.0= 0 =C2=A0 sec =C2=A01.39 MBytes =C2=A011.7 Mbits/sec =C2=A0 =C2=A02 =C2=A0 3= 1.0 KBytes =C2=A0 =C2=A0 =C2=A0
[ =C2=A05] =C2=A0 2.00-3.00 =C2=A0 sec = =C2=A01.14 MBytes =C2=A09.59 Mbits/sec =C2=A0 =C2=A04 =C2=A0 24.1 KBytes =C2=A0 =C2=A0 =C2=A0=
[ =C2=A05] =C2=A0 3.00-4.00 =C2=A0 sec =C2=A01.01 MBytes =C2=A08.48 Mb= its/sec =C2=A0 =C2=A03 =C2=A0 30.4 KBytes =C2=A0 =C2=A0 =C2=A0
[ =C2=A0= 5] =C2=A0 4.00-5.00 =C2=A0 sec =C2=A01.33 MBytes =C2=A011.2 Mbits/sec =C2= =A0 =C2=A04 =C2=A0 23.0 KBytes =C2=A0 =C2=A0 =C2=A0
- - - - - - - - - -= - - - - - - - - - - - - - - -
[ ID] Interval =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 Transfer =C2=A0 =C2=A0 Bitrate =C2=A0 =C2=A0 =C2=A0 =C2=A0 Retr<= br>[ =C2=A05] =C2=A0 0.00-5.00 =C2=A0 sec =C2=A07.46 MBytes =C2=A012.5 Mbit= s/sec =C2=A0 20 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 sender
[ =C2= =A05] =C2=A0 0.00-5.02 =C2=A0 sec =C2=A07.23 MBytes =C2=A012.1 Mbits/sec = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0receiver
<= br>iperf Done.

after test:
cc@s1:~ % netstat -sp tcp | egrep &quo= t;tcp:|retrans|SACK"
tcp:
20 data packets (28960 bytes) retran= smitted =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0<<
0 data packe= ts unnecessarily retransmitted
0 retransmit timeouts
0 retransmitt= ed
18 SACK recovery episodes
20 segment rexmits in SACK recovery epis= odes =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 <<= ;
28960 byte rexmits in SACK recovery episodes
598 SACK options (SACK= blocks) received
0 SACK options (SACK blocks) sent
0 SACK retransmis= sions lost
0 SACK scoreboard overflow

> I've tried various transfer protocols: ftp, scp,= rcp, http: results
> are similar for all.=C2=A0 Ping times for the c= losest WAN link is 2.3ms,
> furthest is 60ms.=C2=A0 On the furthest l= ink, we get around 15%
> utilisation. Transfer between
> 2 Wind= ows hosts on the furthest link yields ~80% utilisation.

Thus, theore= tical max cwnd the sender can grow up to is:
2.3ms x 2Mb/s =3D 575 Byts<= br>60ms =C2=A0x 2Mb/s =3D 15000 Bytes (around 15KB)

Best Regards,
Cheng Cui


On Sun, Apr 9, 2023= at 5:31=E2=80=AFAM Scheffenegger, Richard <Richard.Scheffenegger@netapp.com> wrote:
=
Hi,

Adding fbsd-transport too.

For stable-12, I believe all relevant (algorithm) improvements went in.

However, 12.2 is missing D26807 and D26808 - improvements in Cubic to retra= nsmission timeouts (but these are not material)

While 12.1. has none of the improvements done in 2020 to the Cubic module -= D18954, D18982, D19118, D23353, D23655, D25065, D25133, D25744, D24657, D2= 5746, D25976, D26060, D26807, D26808.

These should fix numerous issues in cubic, which would very likely make it = perform poorly particularly on longer duration sessions.

However, Cubic is heavily reliant on a valid measurement of RTT and the epo= ch since the last congestion response (measured in units of RTT). An issue = in getting RTT measured properly would derail cubic for sure (most likely c= ubic would inflate cwnd much faster, then running into significant packet l= oss, very likely loss of retransmissions, followed by retransmission timeou= ts, and shrinking of the ssthresh to small values.


I haven't looked into cc_vegas or the ertt module though.

One more initial question: Are you using timestamps on that long, thin pipe= - or is net.inet.tcp.rfc1323 disabled (more recent versions allow the sele= ctive enablement/disabling of window scaling and timestamps indepentend of = each other, but I don't think this is in and 12 release. (D36863)?

Finally, you could be using SIFTR to track the evolution of the minrtt valu= e over the course of the session.

Although I suspect ultimately a tcpdump including the tcp header (-s 80) , = and the sifter internal state evolution would be optimal to understanding w= hen and why the RTT values go off the rails.


At first glance, the ertt module may be prone to miscalculations, when retr= ansmissions are in play - no special precautions appear to be present, to d= istinguish between the originally sent packet, and any retransmission, nor = any filtering of ACKs which come in as duplicates. Thus there could be a sc= enario, where an ACK for a spurious retransmission, e.g. due to reordering,= could lead to a wrong baseline RTT measurement, which is physically imposs= ible on such a long distance connection...

But again, I haven't looked into the ertt module so far at all.

How are the base stack RTT related values look on these misbehaving session= s?
Tcpcb-> t_rttmin, t_srtt, t_rttvar, t_rxtcur, t_rtttime, t_rtseq, t_rttl= ow, t_rttupdated

Best regards,
=C2=A0 Richard




-----Original Message-----
From: Rodney W. Grimes <freebsd-rwg@gndrsh.dnsmgr.net>
Sent: Sonntag, 9. April 2023 02:59
To: Richard Perini <r= pp@ci.com.au>
Cc: freebsd-hackers@FreeBSD.org; rscheff@FreeBSD.org
Subject: Re: low TCP speed, wrong rtt measurement

NetApp Security WARNING: This is an external email. Do not click links or o= pen attachments unless you recognize the sender and know the content is saf= e.




> On Tue, Apr 04, 2023 at 02:46:34PM -0000, Peter 'PMc' Much wro= te:
> > ** maybe this should rather go the -net list, but then
> > ** there are only bug messages
> >
> > Hi,
> >=C2=A0 =C2=A0I'm trying to transfer backup data via WAN; the l= ink bandwidth is
> > only ~2 Mbit, but this can well run for days and just saturate th= e
> > spare bandwidth.
> >
> > The problem is, it doesn't saturate the bandwidth.
> >
> > I found that the backup application opens the socket in this way:=
> >=C2=A0 =C2=A0 =C2=A0 =C2=A0if ((fd =3D socket(ipaddr->GetFamily= (), SOCK_STREAM, 0)) < 0) {
> >
> > Apparently that doesn't work well. So I patched the applicati= on to
> > do it this way:
> > -=C2=A0 =C2=A0 =C2=A0 if ((fd =3D socket(ipaddr->GetFamily(), = SOCK_STREAM, 0)) < 0) {
> > +=C2=A0 =C2=A0 =C2=A0 if ((fd =3D socket(ipaddr->GetFamily(), = SOCK_STREAM,
> > + IPPROTO_TCP)) < 0) {
> >
> > The result, observed with tcpdump, was now noticeably different, = but
> > rather worse than better.
> >
> > I tried various cc algorithms, all behaved very bad with the
> > exception of cc_vegas. Vegas, after tuning the alpha and beta, ga= ve
> > satisfying results with less than 1% tradeoff.
> >
> > But only for a time. After transferring for a couple of hours the=
> > throughput went bad again:
> >
> > # netstat -aC
> > Proto Recv-Q Send-Q Local Address=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 Foreign Address=C2=A0 =C2=A0 =C2=A0 =C2=A0 (state)=C2=A0 =C2=A0 =C2=A0C= C=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 cwin=C2=A0 =C2=A0ssthresh=C2=A0 =C2=A0M= SS ECN
> > tcp6=C2=A0 =C2=A0 =C2=A0 =C2=A00=C2=A0 57351 edge-jo.26996=C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 pole-n.22=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 ESTABLISHED vegas=C2=A0 =C2=A0 =C2=A0 22203=C2=A0 =C2=A0 =C2=A0 = 10392=C2=A0 1311 off
> > tcp4=C2=A0 =C2=A0 =C2=A0 =C2=A00 106305 edge-e.62275=C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0pole-n.bacula-sd=C2=A0 =C2=A0 =C2=A0 =C2=A0ESTA= BLISHED vegas=C2=A0 =C2=A0 =C2=A0 11943=C2=A0 =C2=A0 =C2=A0 =C2=A05276=C2= =A0 1331 off
> >
> > The first connection is freshly created. The second one runs for = a
> > day already , and it is obviousely hosed - it doesn't recover= .
> >
> > # sysctl net.inet.tcp.cc.vegas
> > net.inet.tcp.cc.vegas.beta: 14
> > net.inet.tcp.cc.vegas.alpha: 8
> >
> > 8 (alpha) x 1331 (mss) =3D 10648
> >
> > The cwin is adjusted to precisely one tick above the alpha, and <= br> > > doesn't rise further. (Increasing the alpha further does solv= e the
> > issue for this connection - but that is not how things are suppos= ed
> > to
> > work.)
> >
> > Now I tried to look into the data that vegas would use for it'= ;s
> > decisions, and found this:
> >
> > # dtrace -n 'fbt:kernel:vegas_ack_received:entry { printf(&qu= ot;%s %u %d
> > %d %d %d", execname,\ (*((struct tcpcb **)(arg0+24)))->sn= d_cwnd,\
> > ((struct ertt *)((*((struct tcpcb
> > **)(arg0+24)))->osd->osd_slots[0]))->minrtt,\
> > ((struct ertt *)((*((struct tcpcb
> > **)(arg0+24)))->osd->osd_slots[0]))->marked_snd_cwnd,\ > > ((struct ertt *)((*((struct tcpcb
> > **)(arg0+24)))->osd->osd_slots[0]))->bytes_tx_in_marked_= rtt,\
> > ((struct ertt *)((*((struct tcpcb
> > **)(arg0+24)))->osd->osd_slots[0]))->markedpkt_rtt);\ > > }'
> > CPU=C2=A0 =C2=A0 =C2=A0ID=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 FUNCTION:NAME
> >=C2=A0 =C2=A06=C2=A0 17478=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0vegas_= ack_received:entry ng_queue 11943 1 11943 10552 131
> >=C2=A0 17=C2=A0 17478=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0vegas_ack_r= eceived:entry ng_queue 22203 56 22203 20784 261
> >=C2=A0 17=C2=A0 17478=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0vegas_ack_r= eceived:entry ng_queue 22203 56 22203 20784 261
> >=C2=A0 =C2=A03=C2=A0 17478=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0vegas_= ack_received:entry ng_queue 11943 1 11943 10552 131
> >=C2=A0 =C2=A05=C2=A0 17478=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0vegas_= ack_received:entry ng_queue 22203 56 22203 20784 261
> >=C2=A0 17=C2=A0 17478=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0vegas_ack_r= eceived:entry ng_queue 11943 1 11943 10552 131
> >=C2=A0 11=C2=A0 17478=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0vegas_ack_r= eceived:entry ng_queue 11943 1 11943 10552 106
> >=C2=A0 15=C2=A0 17478=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0vegas_ack_r= eceived:entry ng_queue 22203 56 22203 20784 261
> >=C2=A0 13=C2=A0 17478=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0vegas_ack_r= eceived:entry ng_queue 22203 56 22203 20784 261
> >=C2=A0 16=C2=A0 17478=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0vegas_ack_r= eceived:entry ng_queue 11943 1 11943 10552 106
> >=C2=A0 =C2=A03=C2=A0 17478=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0vegas_= ack_received:entry ng_queue 22203 56 22203 20784 261
> >
> > One can see that the "minrtt" value for the freshly cre= ated
> > connection is 56 (which is very plausible).
> > But the old and hosed connection shows minrtt =3D 1, which explai= ns
> > the observed cwin.
> >
> > The minrtt gets calculated in sys/netinet/khelp/h_ertt.c:
> >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0e_t->rtt= =3D tcp_ts_getticks() - txsi->tx_ts + 1; There
> > is a "+1", so this was apparently zero.
> >
> > But source and destination are at least 1000 km apart. So either = we
> > have had one of the rare occasions of hyperspace tunnelling, or <= br> > > something is going wrong in the ertt measurement code.
> >
> > For now this is a one-time observation, but it might also explain=
> > why the other cc algorithms behaved badly. These algorithms are <= br> > > widely in use and should work - the ertt measurement however is t= he
> > same for all of them.
>
> I can confirm I am seeing similar problems transferring files to our <= br> > various production sites around Australia. Various types/sizes of link= s and bandwidths.
> I can saturate the nearby links, but the link utilisation/saturation <= br> > decreases with distance.
>
> I've tried various transfer protocols: ftp, scp, rcp, http: result= s
> are similar for all.=C2=A0 Ping times for the closest WAN link is 2.3m= s,
> furthest is 60ms.=C2=A0 On the furthest link, we get around 15%
> utilisation. Transfer between
> 2 Windows hosts on the furthest link yields ~80% utilisation.

Windows should be using cc_cubic, you say above you had tried all the conge= stion algorithims, and only cc_vegas after tuning gave good results.

>
> FreeBSD versions involved are 12.1 and 12.2.

I wonder if cc_cubic is broken in 12.X, it should give similiar results to = windows if things are working correctly.

I am adding Richard Scheffenegger as he is the most recent expect on the co= ngestion control code in FreeBSD.

> --
> Richard Perini
> Ramico Australia Pty Ltd=C2=A0 =C2=A0Sydney, Australia=C2=A0 =C2=A0rpp@ci.com.au=C2=A0 +61= 2 9552 5500
> ----------------------------------------------------------------------=
> ------- "The difference between theory and practice is that in th= eory
> there is no=C2=A0 difference, but in practice there is"

--
Rod Grimes=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0rgrimes@freebsd.org
--0000000000003fb16405f8e87f20--