The trouble with sack..

Wed Oct 7 16:18:24 UTC 2015

Greetings all:

Hiren and I have been poking a little bit with the TCP-Sack implementation
in FreeBSD and I think we have pretty much determined its sub-optimal to phrase
it nicely :-)

All the sack-scoreboard stuff works, but what we do with the scoreboard and
how we handle SACKs really does not match what the TCP RFC’s say we should.

Here are a few of examples (there are probably more that we will yet discover):

1) When we finally recognize its time to Fast Retransmit we shut the cwnd to 1MTU.  The
    SACK RFC’s tell us to go to 1/2 of the pervious cwnd (which is also stored in ssthresh).

2) When we recognize a dup-ack we *will not* recognize it if for example if the rwnd changes even
    if new SACK information is reported in the sack blocks. This is due to the fact that in non-SACK you don’t
    (on purpose) recognize ACK’s where the window changed (since you can’t really tell if its a
     plain window update or a dup-ack).. This means we occasionally miss out
    on stroking the dup-ack counter and getting out of recovery....

3) When we have more than one hole the goal of SACK was to retransmit every time that
    a hole had 3 dup-acks so that one could recover multiple blocks that were lost. We just
    plain don’t track dup-acks per hole. We do continue to count, but we will wait to retransmit
    anything until after we have drained 1/2 the data in flight from the network at a minimum. And only then
    do we start incrementing cwnd (remember we crashed it to 1 MTU) so that we can retransmit. There
    may be some other twists in the code that we are missing but this is what we believe (this could could
    probably win the C obfuscation contest if someone were willing to enter it :-D)

4) The way we calculate what is in flight with SACK is wrong, basically we don’t arrive at
     whats really in flight, which with SACK you can know if you have a properly maintained 
     scoreboard (which we do have).

Hiren and I have a few ideas on how to fix some of these, but I think we may want to discuss
first what  Gleb talked about doing at BSD-Canada, at least so I am told, which is to
have each inpcb have a set of function pointers so we can create “new” versions of say
tcp_do_segment and tcp_output.. without changing original ones..

This way, has we develop fixes and improvements,  we can keep the old code in place without
disrupting everyone and then after everyone has vetted and played with the “new” code we can
switch things out :-)

By the way just looking around at NF and doing some quick survery’s of SACK, about 99% of
NF connections seem to have sack enabled, so its pretty much widely deployed now.. and its rare
we are *not* using the SACK cases in our TCP stack..

Best wishes

R
--------
Randall Stewart
rrs at netflix.com
803-317-4952