[tcpm] ECN++

Scheffenegger, Richard Richard.Scheffenegger at netapp.com
Fri Feb 14 19:06:31 UTC 2020

Hi Bob,

> Interesting.
> See inline tagged [BB]...
>> On 25/01/2020 10:41, Scheffenegger, Richard wrote:
>> Hi Group, Marcelo, Bob,
>> Another update in this context, which IMHO may be discussed as an 
>> actual change of mechanism with ECN++.
>> I was looking into the very poor interaction of ECN between a Linux 
>> client and a BSD server, with request-response workload. That is, 
>> where each side sends out less than MSS data, before the application 
>> waits for the other side to respond to this.
>> Neal pointed out this statement in RFC3168:
>>    ...the TCP sender sets the CWR flag in
>>    the TCP header of the first new data packet sent after the window
>>    reduction.  ...
>>    When the TCP data sender is ready to set the CWR bit after reducing
>>    the congestion window, it SHOULD set the CWR bit only on the first
>>    new data packet that it transmits.
>> However, BSD is sending out the CWR as soon as possible - while Linux 
>> interprets the SHOULD overly strictly (IMHO) and ignores CWR unless it 
>> is received with (new) data.
> [BB] Assuming the word '(new)' is optional, I think you're implying that 
> BSD would set CWR on a pure ACK if that were its first packet after it 
> received ECE feedback. Does BSD also set CWR on the first data packet
> after that (if any)?

Currently, FBSD is sending CWR with the next packet (pure ACK, Window 
Probe, Retransmission or New Data). But this should be fixed soon(ish).

I used the bracket because RFC3168 is very clear, that CWR should *only* 
be sent with new data, but Linux would also accept it on any packet with 
data (this includes retransmissions and window probes).

> I think RFC3168 expects CWR only on data packets - so that the sender 
> of the CWR can distinguish between ECE that the receiver sends before 
> vs after the receiver got the CWR (by whether the ackno of the ECE 
> covers the CWR data packet or not).

Yes, effectively, CWR is bound to snd_max+1 in the sequence space on 
the first transmission with RFC3168. Unfortunately, it's not stated that 

> Consider 4 data packet exchanges of A>B, B>A, B>A, A>B with CWR 
> on pure ACKs.
>     A>>>B Data#1 <CE in transit>
>     A<<<B ACK#1 ECE
>                             ...potentially quiet for a time...
>     A<<<B Data#101 ECE ACK#1
>     A>>>B ACK#101 *CWR*
>                             ...potentially quiet for a time...
>     A<<<B Data#102 ECE ACK#1
>     A>>>B ACK#102
>                             ...potentially quiet for a time...
>     A>>>B Data#2 *CWR* ACK#102
>     A<<<B ACK#2
> 'A' doesn't know whether the ECE on Data#102 was sent before or 
> after B received the CWR on A's pure ACK, so 'A' doesn't know 
> whether to reduce its window again or not.
> I can't find anything in RFC3168 that explicitly spells out when the sender 
> considers ECE to be in a new round. It jumps straight to describing the 
> exceptional case of a CWR packet being dropped, and omits the 'normal' 
> case of it being delivered.

Well, if the ECE do not stop with the ACK covering the (former) snd_max+1,
the CWR may have been lost (or delivered with a CE mark). In either case, 
another congestion response is appropriate.

> This seems to be missing - perhaps it ought to be added by an erratum.
>> But binding the CWR flag to a new data segment delays the ECN 
>> signaling loop artificially (for long runs of unidirectional 
>> transmitted data), and it is not clear what the benefit there would 
>> be, as the CWR flag is not retransmitted anyway (thus not bound to a 
>> point in the sequence number space).
> [BB] Surely long runs of unidirectional transmitted data don't exhibit 
> this problem, 'cos there's plenty of new data to carry the CWR. Or 
> have I misunderstood?

Transactional data has frequent changes in data direction. But that behavior
(delaying CWR to only send it with new data) IMHO makes ECN unusable for Ack
Moderation (AckCC). Just wondering, if a different way of tracking the packets
In flight during a window would allow the immediate transmission of CWR 
with ECN++, to make ECN useful for Ack Moderation or other future purposes.

> In fact, the problem I see with RFC3168 is the opposite case. It seems
> there was an assumption that a data sender would be continually sending
> data, so that, once ECE feedback appeared at the sender, it would conveniently
> always have some data to send, on which CWR could be carried.

Correct. With transactional data, this assumption breaks and long flights of 
Packets will continuously have ECE set.

> For instance, in the sequence above, host A might not send Data#2 for ages 
> or perhaps never (a typical case if 'A' is a client requesting a large object). In
> the intervening time, B might send far more than the two packets shown. If 'A'
> does not set CWR on pure ACKs, all B's data packets would have to carry ECE, 
> perhaps for many hours, until A has some more data to send (if ever).
> Nontheless, I think ECE continuing for hours is fine within the logic of RFC3168. 
> While A isn't sending anything, it only reduces cwnd once, and it's not 
> measuring any round trips, so it's not increasing cwnd either.
> Can you describe your case more precisely, so I can understand what caused 
> the performance hit?

The problem comes from FBSD sending CWR (typically) on pure ACKs - where
Linux ignores the CWR and keeps ECE set; Thus FBSD observes "new" ECE in the 
Following window, resulting in another congestion response - while no further
CE mark was actually present.

This leads to a race to the bottom for FBSD, where cwnd contionously shrinks 
(due to another bug, down to a size of 0 [zero]). Eventually, FBSD clocks out
1 byte every 4 seconds with a timer normally used for Window Probes (but
Unlike Window Probes, that one byte is actually new data).

Eventually CWR may be sent with a data packet (that 1-byte probe), accepted
By Linux, ECE unlatched, and cwnd can start growing again. But that event may
Happen only many minutes into entering this situation (got a trace, where this 
effect lasted nearly 10 minutes).

Obviously, the reverse data direction doesn’t suffer the same problem, nor does
a typical FBSD-FBSD ECN session suffer that fate (even though CWR are sent mostly
on pure ACKs)

>> I therefore propose a change in the Generalized ECN draft, to lift the 
>> above restriction (while it is "only" a SHOULD, this is one more 
>> example of an overly strict receiving-side implementation), and no 
>> longer artificially delay the CWR signal - to become also more useful 
>> for passive measurements.
> [BB] I'm not yet convinced that this CWR behaviour is anything to do with 
> the ECN++ draft. But that might be because I've misunderstood your 
> description. As I said above, it might be possible to rectify omissions 
> with an erratum to RFC3168.

In RFC3168, both ECT-codepoints and the CWR flag are defined to only be set on
new data segments.

Thus setting both can be simplified into a single branch; Getting the CWR indication
Slightly faster (not binding it to snd_max+1), that is, even with the next pure ack, probe
Or retransmission, would possibly provide a faster feedback (passive measurement
Of the window), and could enable the use of ECN signals for ACK moderation.

The drawback would be additional congestion responses, if the packet carrying the CWR
Is dropped (congestion on the path carrying the pure ACKs). But IIRC, reacting to congestion
On the (relatively) low bandwidth reverse direction by moderating the transmission speed 
(or asking for fewer ACKs) may be not too inappropriate...


>> For those interested: The effect of ignoring the CWR on non-new-data 
>> segments by Linux is, that the ECE flag is left latched. Thus BSD 
>> continues window-after-window with cwnd reductions,
> [BB] If it's not sending new data, how does the BSD host consider that 
> windows are starting or completing?

It still tracks snd_recover (snd_max when the first ECE is received). I have not
Looked into this in detail yet, though. It may even reduce cwnd while being
The passive receiver...

> Bob

More information about the freebsd-transport mailing list