[tcpm] ECN++

Bob Briscoe in at bobbriscoe.net
Sat Feb 15 09:18:38 UTC 2020

Richard, (pls cc tsvwg as the WG equally responsible for TCP ECN 
maintenance, if you think appropriate)

On 14/02/2020 19:06, Scheffenegger, Richard wrote:
> Hi Bob,
>> Interesting.
>> See inline tagged [BB]...
>>> On 25/01/2020 10:41, Scheffenegger, Richard wrote:
>>> Hi Group, Marcelo, Bob,
>>> Another update in this context, which IMHO may be discussed as an
>>> actual change of mechanism with ECN++.
>>> I was looking into the very poor interaction of ECN between a Linux
>>> client and a BSD server, with request-response workload. That is,
>>> where each side sends out less than MSS data, before the application
>>> waits for the other side to respond to this.
>>> Neal pointed out this statement in RFC3168:
>>>     ...the TCP sender sets the CWR flag in
>>>     the TCP header of the first new data packet sent after the window
>>>     reduction.  ...
>>>     When the TCP data sender is ready to set the CWR bit after reducing
>>>     the congestion window, it SHOULD set the CWR bit only on the first
>>>     new data packet that it transmits.
>>> However, BSD is sending out the CWR as soon as possible - while Linux
>>> interprets the SHOULD overly strictly (IMHO) and ignores CWR unless it
>>> is received with (new) data.
>> [BB] Assuming the word '(new)' is optional, I think you're implying that
>> BSD would set CWR on a pure ACK if that were its first packet after it
>> received ECE feedback. Does BSD also set CWR on the first data packet
>> after that (if any)?
> Currently, FBSD is sending CWR with the next packet (pure ACK, Window
> Probe, Retransmission or New Data). But this should be fixed soon(ish).
> I used the bracket because RFC3168 is very clear, that CWR should *only*
> be sent with new data, but Linux would also accept it on any packet with
> data (this includes retransmissions and window probes).
>> I think RFC3168 expects CWR only on data packets - so that the sender
>> of the CWR can distinguish between ECE that the receiver sends before
>> vs after the receiver got the CWR (by whether the ackno of the ECE
>> covers the CWR data packet or not).
> Yes, effectively, CWR is bound to snd_max+1 in the sequence space on
> the first transmission with RFC3168. Unfortunately, it's not stated that
> clearly.
>> Consider 4 data packet exchanges of A>B, B>A, B>A, A>B with CWR
>> on pure ACKs.
>>      A>>>B Data#1 <CE in transit>
>>      A<<<B ACK#1 ECE
>>                              ...potentially quiet for a time...
>>      A<<<B Data#101 ECE ACK#1
>>      A>>>B ACK#101 *CWR*
>>                              ...potentially quiet for a time...
>>      A<<<B Data#102 ECE ACK#1
>>      A>>>B ACK#102
>>                              ...potentially quiet for a time...
>>      A>>>B Data#2 *CWR* ACK#102
>>      A<<<B ACK#2
>> 'A' doesn't know whether the ECE on Data#102 was sent before or
>> after B received the CWR on A's pure ACK, so 'A' doesn't know
>> whether to reduce its window again or not.
>> I can't find anything in RFC3168 that explicitly spells out when the sender
>> considers ECE to be in a new round. It jumps straight to describing the
>> exceptional case of a CWR packet being dropped, and omits the 'normal'
>> case of it being delivered.
> Well, if the ECE do not stop with the ACK covering the (former) snd_max+1,
> the CWR may have been lost (or delivered with a CE mark).
[BB] Which end are you talking about? As sender or as receiver? Which 
stage in the sequence are you talking about? Can you be more specific 
relative to the above example, where there are two transactions in a row 
initiated from B without any data from A in between. I deliberately 
constructed this so that A will not be able to tell whether B sent the 
ECE on its second volley (Data#102) before or after B received the CWR 
on A's pure ACK of B's first volley (Data#101).

My purpose was to prove that, if A sends CWR on a pure ACK, it makes 
itself unable to tell whether to reduce its window in response to any 
ECE it receives subsequently, because it cannot tell whether they are 
new or repeats. Therefore, RFC3168 must have meant 'the TCP sender MUST 
set the CWR flag in the TCP header of the first new data packet...' when 
it said:

    "the TCP sender sets the CWR flag in
    the TCP header of the first new data packet sent after the window

I suspect no-one thought to preclude the opposite, i.e. to say 'The 
sender MUST NOT set CWR on packets without new data'. Instead, it went 
straight to clarifying whether '/first/ new data' needed to be mandatory.

    "it SHOULD set the CWR bit only on the first
    new data packet that it transmits"

I reviewed RFC3168 at the time, as did many others, but no-one noticed 
these omissions.

> In either case,
> another congestion response is appropriate.
[BB] Surely not, if the CWR was on a pure ACK?

>> This seems to be missing - perhaps it ought to be added by an erratum.
>>> But binding the CWR flag to a new data segment delays the ECN
>>> signaling loop artificially (for long runs of unidirectional
>>> transmitted data), and it is not clear what the benefit there would
>>> be, as the CWR flag is not retransmitted anyway (thus not bound to a
>>> point in the sequence number space).
>> [BB] Surely long runs of unidirectional transmitted data don't exhibit
>> this problem, 'cos there's plenty of new data to carry the CWR. Or
>> have I misunderstood?
> Transactional data has frequent changes in data direction. But that behavior
> (delaying CWR to only send it with new data) IMHO makes ECN unusable for Ack
> Moderation (AckCC). Just wondering, if a different way of tracking the packets
> In flight during a window would allow the immediate transmission of CWR
> with ECN++, to make ECN useful for Ack Moderation or other future purposes.
[BB] CWR is only used if RFC3168 feedback has been negotiated.
The WG consensus was that we shouldn't put ECT on pure ACKs if RFC3168 
feedback has been negotiated; only if AccECN had been negotiated.

So, once you have AccECN and ECN++, you can do what you are wanting, 
'cos AccECN allows both the data receiver to set ECT on pure ACKs and 
the data sender to count CEs on arriving pure ACKs. The data sender then 
includes this count in the ACE counter it feeds back on subsequent 
packets it sends (probably data packets, but could be on pure ACKs as 
well when data is bi-directional).

which refers to

>> In fact, the problem I see with RFC3168 is the opposite case. It seems
>> there was an assumption that a data sender would be continually sending
>> data, so that, once ECE feedback appeared at the sender, it would conveniently
>> always have some data to send, on which CWR could be carried.
> Correct. With transactional data, this assumption breaks and long flights of
> Packets will continuously have ECE set.
[BB] I have a terminology blockage here. I interpret 'long flights of 
packets' to be a long-running flow - the opposite of 'transactional 
data'. But you seem to be using 'long flights of data' to describe 
'transactional data'. Not sure what the disconnect is.

>> For instance, in the sequence above, host A might not send Data#2 for ages
>> or perhaps never (a typical case if 'A' is a client requesting a large object). In
>> the intervening time, B might send far more than the two packets shown. If 'A'
>> does not set CWR on pure ACKs, all B's data packets would have to carry ECE,
>> perhaps for many hours, until A has some more data to send (if ever).
>> Nontheless, I think ECE continuing for hours is fine within the logic of RFC3168.
>> While A isn't sending anything, it only reduces cwnd once, and it's not
>> measuring any round trips, so it's not increasing cwnd either.
>> Can you describe your case more precisely, so I can understand what caused
>> the performance hit?
> The problem comes from FBSD sending CWR (typically) on pure ACKs - where
> Linux ignores the CWR and keeps ECE set; Thus FBSD observes "new" ECE in the
> Following window, resulting in another congestion response - while no further
> CE mark was actually present.
[BB] I think you are solely talking about the case where sending of each 
transactional data volley strictly alternates between A and B.

I am trying to get you to think about a case where one end sends more 
than one volley in succession, in order to show that setting CWR on a 
pure ACK creates problems. That's why I asked, how does an FBSD receiver 
(or Linux receiver for that matter) know what a 'following window' is if 
it's only sending pure ACKs? They all have the same seqno and they don't 
get ACK'd. So the pure receiver can't know what an RTT is.

I also asked whether, if the FBSD host (that has been receiving data) 
does send some data between its pure ACKs, does it repeat the CWR on the 
data even tho it already set it on an earlier pure ACK?

> This leads to a race to the bottom for FBSD, where cwnd contionously shrinks
> (due to another bug, down to a size of 0 [zero]). Eventually, FBSD clocks out
> 1 byte every 4 seconds with a timer normally used for Window Probes (but
> Unlike Window Probes, that one byte is actually new data).
> Eventually CWR may be sent with a data packet (that 1-byte probe), accepted
> By Linux, ECE unlatched, and cwnd can start growing again. But that event may
> Happen only many minutes into entering this situation (got a trace, where this
> effect lasted nearly 10 minutes).
> Obviously, the reverse data direction doesn’t suffer the same problem, nor does
> a typical FBSD-FBSD ECN session suffer that fate (even though CWR are sent mostly
> on pure ACKs)
>>> I therefore propose a change in the Generalized ECN draft, to lift the
>>> above restriction (while it is "only" a SHOULD, this is one more
>>> example of an overly strict receiving-side implementation), and no
>>> longer artificially delay the CWR signal - to become also more useful
>>> for passive measurements.
>> [BB] I'm not yet convinced that this CWR behaviour is anything to do with
>> the ECN++ draft. But that might be because I've misunderstood your
>> description. As I said above, it might be possible to rectify omissions
>> with an erratum to RFC3168.
> In RFC3168, both ECT-codepoints and the CWR flag are defined to only be set on
> new data segments.
[BB] Not strictly true for the CWR flag, as discussed earlier, this part 
of RFC3168 seems not to have been fully tied up.

> Thus setting both can be simplified into a single branch; Getting the CWR indication
> Slightly faster (not binding it to snd_max+1), that is, even with the next pure ack, probe
> Or retransmission, would possibly provide a faster feedback (passive measurement
> Of the window), and could enable the use of ECN signals for ACK moderation.
> The drawback would be additional congestion responses, if the packet carrying the CWR
> Is dropped (congestion on the path carrying the pure ACKs). But IIRC, reacting to congestion
> On the (relatively) low bandwidth reverse direction by moderating the transmission speed
> (or asking for fewer ACKs) may be not too inappropriate...
[BB] I don't believe the CWR mechanism is suitable for faster congestion 
notification or for Ack CC. The way it's designed is tied in so much to 
reliable delivery that it actually breaks if you send it unreliably on 
pure ACKs (as I've tried to prove, and as you have discovered).

>> more...
>>> For those interested: The effect of ignoring the CWR on non-new-data
>>> segments by Linux is, that the ECE flag is left latched. Thus BSD
>>> continues window-after-window with cwnd reductions,
>> [BB] If it's not sending new data, how does the BSD host consider that
>> windows are starting or completing?
> It still tracks snd_recover (snd_max when the first ECE is received). I have not
> Looked into this in detail yet, though. It may even reduce cwnd while being
> The passive receiver...
[BB] (repeating what I've said already) I don't see how it can know what 
an RTT is when its continually sending pure ACKs with the same seqno.


>> Bob

Bob Briscoe                               http://bobbriscoe.net/

More information about the freebsd-transport mailing list