SACK + RTO interaction
Richard.Scheffenegger at netapp.com
Mon Jan 13 23:39:59 UTC 2020
I believe, Cheng has uncovered another long lurking bug, this time in the interaction between RTO and SACK.
Since its inception, tcp_sack_partialack stops the RTO timer apparently - which doesn't seem right.
What he observed is that if you run twice during the same SACK loss recovery episode into a lost retransmission (which is currently only recoverable by RTO), the initial loss is recovered by the RTO (unless an partial ACK disabled the timer prior to it firing), and the 2nd twice lost segment is at the mercy of any other tcp timer which hopefully is still active (keepalive, persist, ...).
I strongly suspect, that this should never cancelled the RTO, but reset it anew after a partial ACK. At least that would be more logical - to pull forward the timeout, if you are making some forward progress - not to stop the timeout completely, if one (of possibly many) retransmissions went through; if SACK loss recovery doesn't complete in an RTO timeout (which is many more RTTs than the single RTT a SACK loss recover should be taking), it would be prudent to give up and fall back to RTO, not?
This effect may also explain some of the other sporadic, very lengthy SACK recoveries we couldn't really pin down so far...
The patch should be easy enough
tcp_timer_activate(tp, TT_REXMT, tp->t_rxtcur);
tcp_timer_activate(tp, TT_REXMT, 0);
BTW, found the same in Darwin.
Consulting Solution Architect
NAS & Networking
+43 1 3676 811 3157 Direct Phone
+43 664 8866 1857 Mobile Phone
Richard.Scheffenegger at netapp.com<mailto:Richard.Scheffenegger at netapp.com>
[Welcome to Data Driven]<https://datavisionary.netapp.com/>
[Facebook]<https://www.facebook.com/NetApp?fref=ts> [Twitter] <https://twitter.com/NetApp>
More information about the freebsd-transport