low TCP speed, wrong rtt measurement

From: Peter 'PMc' Much <pmc_at_citylink.dinoex.sub.org>
Date: Tue, 04 Apr 2023 14:46:34 UTC
** maybe this should rather go the -net list, but then
** there are only bug messages

Hi,
  I'm trying to transfer backup data via WAN; the link bandwidth is
only ~2 Mbit, but this can well run for days and just saturate the spare
bandwidth. 

The problem is, it doesn't saturate the bandwidth.

I found that the backup application opens the socket in this way:
      if ((fd = socket(ipaddr->GetFamily(), SOCK_STREAM, 0)) < 0) {

Apparently that doesn't work well. So I patched the application to do
it this way:
-      if ((fd = socket(ipaddr->GetFamily(), SOCK_STREAM, 0)) < 0) {
+      if ((fd = socket(ipaddr->GetFamily(), SOCK_STREAM, IPPROTO_TCP)) < 0) {

The result, observed with tcpdump, was now noticeably different, but
rather worse than better.

I tried various cc algorithms, all behaved very bad with the exception
of cc_vegas. Vegas, after tuning the alpha and beta, gave satisfying
results with less than 1% tradeoff.

But only for a time. After transferring for a couple of hours the
throughput went bad again:

# netstat -aC
Proto Recv-Q Send-Q Local Address          Foreign Address        (state)     CC          cwin   ssthresh   MSS ECN
tcp6       0  57351 edge-jo.26996          pole-n.22              ESTABLISHED vegas      22203      10392  1311 off
tcp4       0 106305 edge-e.62275           pole-n.bacula-sd       ESTABLISHED vegas      11943       5276  1331 off

The first connection is freshly created. The second one runs for a day
already , and it is obviousely hosed - it doesn't recover.

# sysctl net.inet.tcp.cc.vegas
net.inet.tcp.cc.vegas.beta: 14
net.inet.tcp.cc.vegas.alpha: 8

8 (alpha) x 1331 (mss) = 10648

The cwin is adjusted to precisely one tick above the alpha, and
doesn't rise further. (Increasing the alpha further does solve the
issue for this connection - but that is not how things are supposed to
work.)

Now I tried to look into the data that vegas would use for it's
decisions, and found this:

# dtrace -n 'fbt:kernel:vegas_ack_received:entry { printf("%s %u %d %d %d %d", execname,\
(*((struct tcpcb **)(arg0+24)))->snd_cwnd,\
((struct ertt *)((*((struct tcpcb **)(arg0+24)))->osd->osd_slots[0]))->minrtt,\
((struct ertt *)((*((struct tcpcb **)(arg0+24)))->osd->osd_slots[0]))->marked_snd_cwnd,\
((struct ertt *)((*((struct tcpcb **)(arg0+24)))->osd->osd_slots[0]))->bytes_tx_in_marked_rtt,\
((struct ertt *)((*((struct tcpcb **)(arg0+24)))->osd->osd_slots[0]))->markedpkt_rtt);\
}'
CPU     ID                    FUNCTION:NAME
  6  17478         vegas_ack_received:entry ng_queue 11943 1 11943 10552 131
 17  17478         vegas_ack_received:entry ng_queue 22203 56 22203 20784 261
 17  17478         vegas_ack_received:entry ng_queue 22203 56 22203 20784 261
  3  17478         vegas_ack_received:entry ng_queue 11943 1 11943 10552 131
  5  17478         vegas_ack_received:entry ng_queue 22203 56 22203 20784 261
 17  17478         vegas_ack_received:entry ng_queue 11943 1 11943 10552 131
 11  17478         vegas_ack_received:entry ng_queue 11943 1 11943 10552 106
 15  17478         vegas_ack_received:entry ng_queue 22203 56 22203 20784 261
 13  17478         vegas_ack_received:entry ng_queue 22203 56 22203 20784 261
 16  17478         vegas_ack_received:entry ng_queue 11943 1 11943 10552 106
  3  17478         vegas_ack_received:entry ng_queue 22203 56 22203 20784 261

One can see that the "minrtt" value for the freshly created connection
is 56 (which is very plausible).
But the old and hosed connection shows minrtt = 1, which explains the
observed cwin.

The minrtt gets calculated in sys/netinet/khelp/h_ertt.c:
              e_t->rtt = tcp_ts_getticks() - txsi->tx_ts + 1;
There is a "+1", so this was apparently zero.

But source and destination are at least 1000 km apart. So either we
have had one of the rare occasions of hyperspace tunnelling, or
something is going wrong in the ertt measurement code.

For now this is a one-time observation, but it might also explain why
the other cc algorithms behaved badly. These algorithms are widely in
use and should work - the ertt measurement however is the same for all of
them.