Fast recovery ssthresh value

Tue Sep 15 20:22:19 UTC 2020

Hi Richard,

Thanks. It works well now. We also had an observation that the
majority of tinygrams were caused by the t_maxseg usage in
tcp_sack_partialack(). Now with PRR we rarely see tinygrams because 1)
tcp_sack_partialack() is no longer called 2) the updated PRR patch

Just one comment on the patch: line 2546 is redundant because maxseg
is already defined and calculated in line 2477

Regards,
Liang

On Tue, Sep 15, 2020 at 10:35 AM Scheffenegger, Richard
<Richard.Scheffenegger at netapp.com> wrote:
>
> Hi Liang,
>
> I was about to send out this email notifying you of the changes to the patch, where you uncovered the issues with TSopt enabled TCP flows.
>
> https://reviews.freebsd.org/D18892
>
> Can you please re-patch your test machine with this updated version (I fixed one merge issue due to whitespace cleanup recently too, so it should apply cleanly to HEAD now).
>
> Please let us know and share any comments and criticism about this patch!
>
> Thanks again for testing - and finding the overlooked combination with timestamps.
>
>
> Richard Scheffenegger
>
> -----Original Message-----
> From: Liang Tian <l.tian.email at gmail.com>
> Sent: Freitag, 11. September 2020 19:02
> To: Scheffenegger, Richard <Richard.Scheffenegger at netapp.com>
> Cc: FreeBSD Transport <freebsd-transport at freebsd.org>
> Subject: Re: Fast recovery ssthresh value
>
> NetApp Security WARNING: This is an external email. Do not click links or open attachments unless you recognize the sender and know the content is safe.
>
>
>
>
> Hi Richard,
>
> Initial tests show PRR is doing quite well. See trace below showing response to TSval 2713381916 and 2713381917.
> I have a comment on the patch: I think all the tp->t_maxseg should be replaced with maxseg in the diff (https://reviews.freebsd.org/D18892),
> where maxseg = tcp_maxseg(tp). This will take TCP options(timestamp in this case) into account and avoid sending the tinygrams with len 120 and 36 in the trace below.
> Interestingly we were also chasing another issue where we see a lot of
> 12 bytes segments when retransmission happens(before applying PRT patch), we are suspecting the mixed usage of t_maxseg and maxseg =
> tcp_maxseg(tp) in the tcp code is causing this: the CCA modules are all using t_maxseg for CWND increase instead of effective SMSS.
>
> [TCP Dup ACK 41541#3] 52466  >  80 [ACK] Seq=156 Ack=44596441
> Win=3144704 Len=0 TSval=2713381914 TSecr=1636604730 SLE=46785317
> SRE=46790869
> [TCP Dup ACK 41541#4] 52466  >  80 [ACK] Seq=156 Ack=44596441
> Win=3144704 Len=0 TSval=2713381916 TSecr=1636604730 SLE=46785317
> SRE=46804749
> [TCP Dup ACK 41541#5] 52466  >  80 [ACK] Seq=156 Ack=44596441
> Win=3144704 Len=0 TSval=2713381917 TSecr=1636604730 SLE=46785317
> SRE=46808913
> [TCP Out-Of-Order] 80  >  52466 [ACK] Seq=44597853 Ack=156 Win=1048576
> Len=1388 TSval=1636604904 TSecr=2713381916 [TCP Out-Of-Order] 80  >  52466 [ACK] Seq=44599241 Ack=156 Win=1048576
> Len=1388 TSval=1636604904 TSecr=2713381916 [TCP Out-Of-Order] 80  >  52466 [ACK] Seq=44600629 Ack=156 Win=1048576
> Len=1388 TSval=1636604904 TSecr=2713381916 [TCP Out-Of-Order] 80  >  52466 [ACK] Seq=44602017 Ack=156 Win=1048576
> Len=1388 TSval=1636604904 TSecr=2713381916 [TCP Out-Of-Order] 80  >  52466 [ACK] Seq=44603405 Ack=156 Win=1048576
> Len=1388 TSval=1636604904 TSecr=2713381916 [TCP Out-Of-Order] 80  >  52466 [ACK] Seq=44604793 Ack=156 Win=1048576
> Len=1388 TSval=1636604904 TSecr=2713381916 [TCP Out-Of-Order] 80  >  52466 [ACK] Seq=44606181 Ack=156 Win=1048576
> Len=1388 TSval=1636604904 TSecr=2713381916 [TCP Out-Of-Order] 80  >  52466 [ACK] Seq=44607569 Ack=156 Win=1048576
> Len=1388 TSval=1636604904 TSecr=2713381916 [TCP Out-Of-Order] 80  >  52466 [ACK] Seq=44608957 Ack=156 Win=1048576
> Len=1388 TSval=1636604904 TSecr=2713381916 [TCP Out-Of-Order] 80  >  52466 [ACK] Seq=44610345 Ack=156 Win=1048576
> Len=1388 TSval=1636604904 TSecr=2713381916 [TCP Out-Of-Order] 80  >  52466 [ACK] Seq=44611733 Ack=156 Win=1048576
> Len=120 TSval=1636604904 TSecr=2713381916 [TCP Out-Of-Order] 80  >  52466 [ACK] Seq=44611853 Ack=156 Win=1048576
> Len=1388 TSval=1636604905 TSecr=2713381917 [TCP Out-Of-Order] 80  >  52466 [ACK] Seq=44613241 Ack=156 Win=1048576
> Len=1388 TSval=1636604905 TSecr=2713381917 [TCP Out-Of-Order] 80  >  52466 [ACK] Seq=44614629 Ack=156 Win=1048576
> Len=1388 TSval=1636604905 TSecr=2713381917 [TCP Out-Of-Order] 80  >  52466 [ACK] Seq=44616017 Ack=156 Win=1048576
> Len=36 TSval=1636604905 TSecr=2713381917 [TCP Dup ACK 41541#6] 52466  >  80 [ACK] Seq=156 Ack=44596441
> Win=3144704 Len=0 TSval=2713381925 TSecr=1636604730 SLE=46785317
> SRE=46867209
> [TCP Out-Of-Order] 80  >  52466 [ACK] Seq=44616053 Ack=156 Win=1048576
> Len=1388 TSval=1636604912 TSecr=2713381925 [TCP Out-Of-Order] 80  >  52466 [ACK] Seq=44617441 Ack=156 Win=1048576
> Len=1388 TSval=1636604912 TSecr=2713381925
>
> Thanks,
> Liang
> ...
>
> On Fri, Sep 11, 2020 at 3:40 AM Scheffenegger, Richard <Richard.Scheffenegger at netapp.com> wrote:
> >
> > Perfect!
> >
> > Please share your findings then, as reviews (including informal ones) are needed prior to me committing this patch.
> >
> > Note that it builds upon D18624, which is currently in stable/12 and head, but not any released branches. So you may need to apply that too if you aren't using head.
> >
> > Best regards,
> >
> >
> > Richard Scheffenegger
> >
> > -----Original Message-----
> > From: Liang Tian <l.tian.email at gmail.com>
> > Sent: Freitag, 11. September 2020 06:06
> > To: Scheffenegger, Richard <Richard.Scheffenegger at netapp.com>; FreeBSD
> > Transport <freebsd-transport at freebsd.org>
> > Subject: Re: Fast recovery ssthresh value
> >
> > NetApp Security WARNING: This is an external email. Do not click links or open attachments unless you recognize the sender and know the content is safe.
> >
> >
> >
> >
> > Hi Richard,
> >
> > Thanks! I'm able to apply the patches. I'll test it.
> >
> > Regards,
> > Liang
> >
> >
> >
> > On Thu, Sep 10, 2020 at 5:49 AM Scheffenegger, Richard <Richard.Scheffenegger at netapp.com> wrote:
> > >
> > > Hi Liang,
> > >
> > > Yes, you are absolutely correct about this observation. The SACK loss recovery will only send  one MSS per received ACK right now - and when there is ACK thinning present, will fail to timely recover all the missing packets, eventually receiving no more ACK to clock out more retransmissions...
> > >
> > > I have a Diff in review, to implement Proportional Rate Reduction:
> > >
> > > https://reviews.freebsd.org/D18892
> > >
> > > Which should address not only that issue about ACK thinning, but also the issue that current SACK loss recovery has to wait until pipe drops below ssthresh, before the retransmissions are clocked out. And then, they would actually be clocked out at the same rate at the incoming ACKs. This would be the same rate as when the overload happened (barring any ACK thinning), and as a secondary effect, it was observed that this behavior too can lead to self-inflicted loss - of retransmissions.
> > >
> > > If you have the ability to patch your kernel with D18892 and observe how the reaction is in your dramatic ACK thinning scenario, that would be good to know! The assumption of the Patch was, that - as per TCP RFC requirements - there is one ACK for each received out-of-sequence data segment, and ACK drops / thinning are not happening on such a massive scale as you describe it.
> > >
> > > Best regards,
> > >
> > > Richard Scheffenegger
> > >
> > > -----Original Message-----
> > > From: owner-freebsd-transport at freebsd.org
> > > <owner-freebsd-transport at freebsd.org> On Behalf Of Liang Tian
> > > Sent: Mittwoch, 9. September 2020 19:16
> > > To: Scheffenegger, Richard <Richard.Scheffenegger at netapp.com>
> > > Cc: FreeBSD Transport <freebsd-transport at freebsd.org>
> > > Subject: Re: Fast recovery ssthresh value
> > >
> > > Hi Richard,
> > >
> > > Thanks for the explanation and sorry for the late reply.
> > > I've been investigating SACK loss recovery and I think I'm seeing an
> > > issue similar to the ABC L value issue that I reported
> > > previously(https://reviews.freebsd.org/D26120) and I do believe there is a deviation to RFC3517:
> > > The issue happens when a DupAck is received during SACK loss recovery in the presence of ACK Thinning or receiver enabling LRO, which means the SACK block edges could expand by more than 1 SMSS(We've seen 30*SMSS), i.e. a single DupAck could decrement `pipe` by more than 1 SMSS.
> > > In RFC3517,
> > > (C) If cwnd - pipe >= 1 SMSS, the sender SHOULD transmit one or more segments...
> > >         (C.5) If cwnd - pipe >= 1 SMSS, return to (C.1) So based on RFC, the sender should be able to send more segments if such DupAck is received, because of the big change to `pipe`.
> > >
> > > In the current implementation, the cwin variable, which controls the amount of data that can be transmitted based on the new information, is dictated by snd_cwnd. The snd_cwnd is incremented by 1 SMSS for each DupAck received. I believe this effectively limits the retransmission triggered by each DupAck to 1 SMSS -  deviation.
> > >  307         cwin =
> > >  308             imax(min(tp->snd_wnd, tp->snd_cwnd) - sack_bytes_rxmt, 0);
> > >
> > > As a result, SACK is not doing enough recovery in this scenario and loss has to be recovered by RTO.
> > > Again, I'd appreciate feedback from the community.
> > >
> > > Regards,
> > > Liang Tian
> > >
> > >
> > >
> > >
> > > On Sun, Aug 23, 2020 at 3:56 PM Scheffenegger, Richard <Richard.Scheffenegger at netapp.com> wrote:
> > > >
> > > > Hi Liang,
> > > >
> > > > In SACK loss recovery, you can recover up to ssthresh (prior cwnd/2 [or 70% in case of cubic]) lost bytes - at least in theory.
> > > >
> > > > In comparison, (New)Reno can only recover one lost packet per window, and then keeps on transmitting new segments (ack + cwnd), even before the receipt of the retransmitted packet is acked.
> > > >
> > > > For historic reasons, the semantic of the variable cwnd is overloaded during loss recovery, and it doesn't "really" indicate cwnd, but rather indicates if/when retransmissions can happen.
> > > >
> > > >
> > > > In both cases (also the simple one, with only one packet loss), cwnd should be equal (or near equal) to ssthresh by the time loss recovery is finished - but NOT before! While it may appear like slow-start, the value of the cwnd variable really increases by acked_bytes only per ACK (not acked_bytes + SMSS), since the left edge (snd_una) doesn't move right - unlike during slow-start. But numerically, these different phases (slow-start / sack loss-recovery) may appear very similar.
> > > >
> > > > You could check this using the (loadable) SIFTR module, which captures t_flags (indicating if cong/loss recovery is active), ssthresh, cwnd, and other parameters.
> > > >
> > > > That is at least how things are supposed to work; or have you investigated the timing and behavior of SACK loss recovery and found a deviation to RFC3517? Note that FBSD currently has not fully implemented RFC6675 support (which deviates slightly from 3517 under specific circumstances; I have a patch pending to implemente 6675 rescue retransmissions, but haven't tweaked the other aspects of 6675 vs. 3517.
> > > >
> > > > BTW: While freebsd-net is not the wrong DL per se, TCP, UDP, SCTP specific questions can also be posted to freebsd-transport, which is more narrowly focused.
> > > >
> > > > Best regards,
> > > >
> > > > Richard Scheffenegger
> > > >
> > > > -----Original Message-----
> > > > From: owner-freebsd-net at freebsd.org
> > > > <owner-freebsd-net at freebsd.org> On Behalf Of Liang Tian
> > > > Sent: Sonntag, 23. August 2020 00:14
> > > > To: freebsd-net <freebsd-net at freebsd.org>
> > > > Subject: Fast recovery ssthresh value
> > > >
> > > > Hi all,
> > > >
> > > > When 3 dupacks are received and TCP enter fast recovery, if SACK is used, the CWND is set to maxseg:
> > > >
> > > > 2593                     if (tp->t_flags & TF_SACK_PERMIT) {
> > > > 2594                         TCPSTAT_INC(
> > > > 2595                             tcps_sack_recovery_episode);
> > > > 2596                         tp->snd_recover = tp->snd_nxt;
> > > > 2597                         tp->snd_cwnd = maxseg;
> > > > 2598                         (void) tp->t_fb->tfb_tcp_output(tp);
> > > > 2599                         goto drop;
> > > > 2600                     }
> > > >
> > > > Otherwise(SACK is not in use), CWND is set to maxseg before
> > > > tcp_output() and then set back to snd_ssthresh+inflation
> > > > 2601                     tp->snd_nxt = th->th_ack;
> > > > 2602                     tp->snd_cwnd = maxseg;
> > > > 2603                     (void) tp->t_fb->tfb_tcp_output(tp);
> > > > 2604                     KASSERT(tp->snd_limited <= 2,
> > > > 2605                         ("%s: tp->snd_limited too big",
> > > > 2606                         __func__));
> > > > 2607                     tp->snd_cwnd = tp->snd_ssthresh +
> > > > 2608                          maxseg *
> > > > 2609                          (tp->t_dupacks - tp->snd_limited);
> > > > 2610                     if (SEQ_GT(onxt, tp->snd_nxt))
> > > > 2611                         tp->snd_nxt = onxt;
> > > > 2612                     goto drop;
> > > >
> > > > I'm wondering in the SACK case, should CWND be set back to ssthresh(which has been slashed in cc_cong_signal() a few lines above) before line 2599, like non-SACK case, instead of doing slow start from maxseg?
> > > > I read rfc6675 and a few others, and it looks like that's the case. I appreciate your opinion, again.
> > > >
> > > > Thanks,
> > > > Liang
> > > > _______________________________________________
> > > > freebsd-net at freebsd.org mailing list
> > > > https://lists.freebsd.org/mailman/listinfo/freebsd-net
> > > > To unsubscribe, send any mail to "freebsd-net-unsubscribe at freebsd.org"
> > > _______________________________________________
> > > freebsd-transport at freebsd.org mailing list
> > > https://lists.freebsd.org/mailman/listinfo/freebsd-transport
> > > To unsubscribe, send any mail to "freebsd-transport-unsubscribe at freebsd.org"