AW: [Differential] D26807: move cwnd and ssthresh updates into individual congestion control module

Tue Nov 24 11:24:07 UTC 2020

Hi Lawrence,

>>> First up, it seems buggy to me that the sender side accounts for **//received//** SACK blocks in the "effective" MSS computed by `tcp_maxseg()` for outbound segments. (Not your bug, but your change amplifies the effect of the bug).
>>
>> Not quite sure I get your point here; are you saying the tcp_maxseg() is erraneously including received SACK blocks fom the previous ACK? Maybe you can send an annotated code snippet per direct mail?
>
>  It keys off `tp->rcv_numsacks` per the implementation in tcp_subr.c <https://svnweb.freebsd.org/base/head/sys/netinet/tcp_subr.c?revision=367122&view=markup#l3026>. I'm pretty sure this is an unintentional bug in the implementation of `tcp_maxseg()`. 

I tried to elicit an erraneous response using packetdrill for this, but was not really able to...

If I understand it correct, if the sender has received a number of sack blocks, the next segment should be shorter by that number of bytes, when using tcp_maxseg().

This is the script I tried to see this:

------
// Testing tcp_maxseg() vs. SACK

--tolerance_usecs=2000000

// Load and enable DCTCP module and flush hostcache
0.0 `sync` // in case of crash
//+0.1 `tcpdump -i tun0 -w dsack-test.trc tcp &`
+0.1 `sysctl net.inet.tcp.cc.algorithm=newreno`
+0.1 `sysctl net.inet.tcp.initcwnd_segments=10`
+0.1 `sysctl net.inet.tcp.ecn.enable=1`
+0.1 `sysctl net.inet.tcp.hostcache.purgenow=1`

// Create a listening TCP socket.
+0.50 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0.01 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0.01 setsockopt(3, SOL_SOCKET, SO_SNDBUF, [1048576], 4) = 0
+0.01 setsockopt(3, SOL_SOCKET, SO_DEBUG, [1], 4) = 0
+0.01 bind(3, ..., ...) = 0
+0.01 listen(3, 1) = 0

// Establish a TCP connection.
+0.04 <[noecn] SEW  0:0(0) win 65535 <mss 1012, sackOK, wscale 10, TS val 100 ecr 0, eol>
+0.00 >[noecn] SE. 0:0(0) ack 1 win 65535 <mss 1460,nop,wscale 6,sackOK,TS val 1 ecr 100>
+0.00 <[noecn]  . 1:1(0) ack 1 win 65535 <nop, nop, TS val 101 ecr 1>
+0.00 accept(3, ..., ...) = 4
+0.01 setsockopt(4, SOL_SOCKET, SO_DEBUG, [1], 4) = 0

+0 write(4, ..., 5000) = 5000
// Send 10 packets, with 3 "drops"
+0    >[ect0]  .      1:1001(1000) ack 1 <nop, nop, TS val 2 ecr 101>
+0    >[ect0]  .   1001:2001(1000) ack 1 <nop, nop, TS val 3 ecr 101>
+0    >[ect0]  .   2001:3001(1000) ack 1 <nop, nop, TS val 4 ecr 101>
+0    >[ect0]  .   3001:4001(1000) ack 1 <nop, nop, TS val 5 ecr 101>
+0    >[ect0]  P.  4001:5001(1000) ack 1 <nop, nop, TS val 6 ecr 101>

+0    <[noecn] .   1:1(0) ack 1 win 65535 <nop, nop, TS val 102 ecr 1, nop, nop, sack 4001:5001 2001:3001 1:1001>

+0.01 write(4, ..., 5000) = 5000
+0    >[ect0]  .   5001:6001(1000) ack 1 <nop, nop, TS val 7 ecr 102>
+0    >[ect0]  .   6001:7001(1000) ack 1 <nop, nop, TS val 8 ecr 102>
+0    >[ect0]  .   7001:8001(1000) ack 1 <nop, nop, TS val 9 ecr 102>
+0    >[ect0]  .   8001:9001(1000) ack 1 <nop, nop, TS val 10 ecr 102>
+0    >[ect0]  P.  9001:10001(1000) ack 1 <nop, nop, TS val 11 ecr 102>
------

But rcv_numsacks is not the number of TCP option sack blocks received, but rather the number of distinct ranges currently held in sackblks[], which is a capped number from the scoreboard, so the number of SACK option blocks that may be sent with the next ACK (maximum in TCP options is 4, with TS 3;  tcpcb stores a maximum of 6.

IMHO, that code in tcp_maxseg() is correct...

Richard Scheffenegger
Consulting Solution Architect
NAS & Networking

NetApp
+43 1 3676 811 3157 Direct Phone
+43 664 8866 1857 Mobile Phone
Richard.Scheffenegger at netapp.com

https://ts.la/richard49892

-----Ursprüngliche Nachricht-----
Von: lstewart (Lawrence Stewart) <phabric-noreply at FreeBSD.org> 
Gesendet: Dienstag, 24. November 2020 06:25
An: rscheff at freebsd.org
Betreff: [Differential] D26807: move cwnd and ssthresh updates into individual congestion control module

NetApp Security WARNING: This is an external email. Do not click links or open attachments unless you recognize the sender and know the content is safe.

lstewart added a comment.

  In D26807#610464 <https://reviews.freebsd.org/D26807#610464>, @rscheff wrote:

  > In D26807#610449 <https://reviews.freebsd.org/D26807#610449>, @lstewart wrote:
  >
  >> In D26807#604494 <https://reviews.freebsd.org/D26807#604494>, @rscheff wrote:
  >>
  >>> incrementing by t_maxseg (w/o TSopt) vs. tcp_maxseg() (w/ TSopt/MPTCP) does not lead to a unexpected transmission of 2 segments, when only a single segment would be expected (tcp_output prefers to send full segments but considers option overhead).
  >>>
  >>> Incrementing by tcp_maxseg() is sightly less aggressive during CA, not?
  >>
  >> First up, it seems buggy to me that the sender side accounts for **//received//** SACK blocks in the "effective" MSS computed by `tcp_maxseg()` for outbound segments. (Not your bug, but your change amplifies the effect of the bug).
  >
  > Not quite sure I get your point here; are you saying the tcp_maxseg() is erraneously including received SACK blocks fom the previous ACK? Maybe you can send an annotated code snippet per direct mail?

  It keys off `tp->rcv_numsacks` per the implementation in tcp_subr.c <https://svnweb.freebsd.org/base/head/sys/netinet/tcp_subr.c?revision=367122&view=markup#l3026>. I'm pretty sure this is an unintentional bug in the implementation of `tcp_maxseg()`.

  >> I'll try ground the remainder of my thoughts with my interpretation of the relevant "first principles" and practical considerations...
  >>
  >> The congestion window maintained at the sender attempts to track an estimate of pipe capacity, but does so without knowledge of whether the capacity constraint applies at a particular TCP/IP layer or includes/excludes certain sources of non-application-payload overhead. TCPs typically take the shortcut of approximating pipe capacity purely in terms of TCP segment payload, but most IP paths I'm aware of typically track bits in terms of IP-payload i.e. it is likely more correct for most if not all paths to consider the congestion window as representing pipe capacity inclusive of all TCP segment overhead (fixed header and options). Doing so ensures that for a given congestion window, connections with different TCP segment sizes account for their sent TCP-layer bits the same. Also worth noting that paths often account based on number of packets in addition to or instead of bits i.e. tracking congestion window in terms of both bits and number of segments is potentially of value.
  >>
  >> So, in a high level sense, accounting for TCP option bits in a given send window worth of data is both justified and arguably more correct. However, using `tcp_maxseg()` in congestion control accounting effectively moves us further away from this and towards accounting purely based on payload bits.
  >
  > I agree with you on the first principles approach. We have had this kind of discussion @IETF during reECN (ConEx), and that many don't apprecitate this subtle difference was a major hurdle in the process to get tracktion to AccECN (why there is a higher fidelity bytes-based signal to begin with, and additionally, shall it include vaious header bytes or not).
  > However, such a discussion is certainly well beyond the scope of this Diff (which I've committed already).

  Agreed, the broader discussion is not specific to this changeset (probably should move it to freebsd-transport@) but I wanted to touch on some of my thinking to provide some background for my concerns.

  What is directly relevant to this changeset is the subtle but potentially impactful change to newreno's cwnd accounting by switching to using `tcp_maxseg()` in `newreno_cong_signal()`, which I flagged previously because it is a functional change which also introduces inconsistency within the module given that t_maxseg is still used elsewhere. FYI, I opted to exclude this changeset from our internal Netflix tree until we can either quantify the effects or figure out a different way of addressing the underlying problem being addressed, given my concern that this may pessimise connections with small send windows. That's what prompted me to circle back to this review to better understand the underlying issue being addressed by the change and to publicly flag my concern that there is in fact a subtle functional change being introduced here.

  > The concrete issue addressed here is, that during loss recovery, incrementing cwnd by t_maxseg, when TSopt is in use, leaves 12 payload bytes "unused" in a transmission opportunity, which may be sent out right away. The error of transmitting two segments (with much more header- and potentially ethernet padding bytes, than 12 bytes of payload certainly makes it more attractive to NOT incorrectly included these header option bytes.

  Ah, I see. My initial reaction to such an issue is that we should consider addressing by tuning the runt transmission avoidance checks in tcp_output() <https://svnweb.freebsd.org/base/head/sys/netinet/tcp_output.c?revision=367122&view=markup#l578> rather than marginally gaming cwnd, which may still not prevent a runt transmission if, for example, the effective MSS varied over the preceding window of data for some reason.

  > Revamping cwnd to include various headers, some of which are not known to the sender at transmission time, is certainly beyond the scope of a simply bugfix. (E.g. TCP over IP over GRE over IPsec over VxLAN over MPLS may be a valid chain of headers in the middle of a path, where congestion occurs. Most of these technologies try to be as transparent to an endhost as possible, though...)

  Sure, IMO the only reasonable thing TCP can do is fully account for its bits to most accurately derive its pipe capacity estimate.

  >> I also have a concern that tcp_maxseg() pessimises calculations for connections which use TCP options that do not appear on every segment (SACK being the most common case). It's a rounding error and unlikely to be of concern for sufficiently large send windows, but its a different story at the small window end of things, and even more so for connections using a smaller than typical MSS.
  >
  > IIRC, tcp_maxseg() should take into account those options, which would be present "currently". As you note, SACK options would be the only one such TCP option, that FreeBSD would be using. However, once we tackle this issue or other options which are only infrequently present (e.g. MP-TCP options, AccECN option,  AckCC (?) ) we can address this aspect then.

  It attempts to do so, but crudely. It doesn't track the actual option-related bits per send, doesn't keep a rolling tally per send window or over any other timescale, nor are any retrospective adjustments made if the cumulative effect of slightly incorrect `tcp_maxseg()`-based adjustments begins to add up.

REPOSITORY
  rS FreeBSD src repository

CHANGES SINCE LAST ACTION
  https://reviews.freebsd.org/D26807/new/

REVISION DETAIL
  https://reviews.freebsd.org/D26807

EMAIL PREFERENCES
  https://reviews.freebsd.org/settings/panel/emailpreferences/

To: chengc_netapp.com, rscheff, tuexen, rgrimes, #netapp, #transport, rrs
Cc: lstewart, imp, melifaro, rscheff