Some questions about the new TCP congestion control code

Tue Jan 15 20:27:54 UTC 2013

On Tuesday, January 15, 2013 3:29:51 am Lawrence Stewart wrote:
> Hi John,
> 
> On 01/15/13 08:04, John Baldwin wrote:
> > I was looking at TCP congestion control at work recently and noticed a few 
> 
> Poor you ;)
> 
> > "odd" things in the current code.  First, there is this chunk of code in 
> > cc_ack_received() in tcp_input.c:
> > 
> > static void inline
> > cc_ack_received(struct tcpcb *tp, struct tcphdr *th, uint16_t type)
> > {
> > 	INP_WLOCK_ASSERT(tp->t_inpcb);
> > 
> > 	tp->ccv->bytes_this_ack = BYTES_THIS_ACK(tp, th);
> > 	if (tp->snd_cwnd == min(tp->snd_cwnd, tp->snd_wnd))
> > 		tp->ccv->flags |= CCF_CWND_LIMITED;
> > 	else
> > 		tp->ccv->flags &= ~CCF_CWND_LIMITED;
> > 
> > 
> > Due to hysterical raisins, snd_cwnd and snd_wnd are u_long values, not 
> > integers, so the call to min() results in truncation on 64-bit hosts.
> 
> Good catch, but I don't think it matters in practice as neither snd_cwnd
> or snd_wnd will grow past the 32-bit boundary.

I have a psyhcotic case using cc_cubic where it seems to grow without bound,
though that is a bug in and of itself (and this change did not fix that
issue).  I ended up not using cc_cubic (more below) and haven't been able
to track down the root cause of the delay.  I can probably provide a test case
to reproduce this if you are interested.

> > It should probably be ulmin() instead.  However, this line seems to be a really 
> > obfuscated way to just write:
> > 
> > 	if (tp->snd_cwnd <= tp->snd_wnd)
> 
> You are correct, though I'd argue the meaning of the existing code as
> written is clearer compared to your suggested change.
> 
> > If that is correct, I would vote for changing this to use the much simpler 
> > logic.
> 
> Agreed. While I find the existing code slightly clearer in meaning, it's
> not significant enough to warrant keeping it as is when your suggested
> change is simpler, fixes a bug and achieves the same thing. Happy for
> you to change it or I can do it if you prefer.

I'll leave that to you, thanks.

> > Secondly, in the particular case I was investigating at work (restart of an 
> > idle connnection), the newreno congestion control code in 8.x and later uses a 
> > different algorithm than in 7.  Specifically, in 7 TCP would reuse the same 
> > logic used for an initial cwnd (honoring ss_fltsz).  In 8 this no longer 
> > happens (instead, 2 is hardcoded).  A guess at a possible fix might look 
> > something like this:
> > 
> > Index: cc_newreno.c
> > ===================================================================
> > --- cc_newreno.c	(revision 243660)
> > +++ cc_newreno.c	(working copy)
> > @@ -169,8 +169,21 @@ newreno_after_idle(struct cc_var *ccv)
> >  	if (V_tcp_do_rfc3390)
> >  		rw = min(4 * CCV(ccv, t_maxseg),
> >  		    max(2 * CCV(ccv, t_maxseg), 4380));
> > +#if 1
> >  	else
> >  		rw = CCV(ccv, t_maxseg) * 2;
> > +#else
> > +	/* XXX: This is missing a lot of stuff that used to be in 7. */
> > +#ifdef INET6
> > +	else if ((isipv6 ? in6_localaddr(&CCV(ccv, t_inpcb->in6p_faddr)) :
> > +		in_localaddr(CCV(ccv, t_inpcb->inp_faddr))))
> > +#else
> > +	else if (in_localaddr(CCV(ccv, t_inpcb->inp_faddr)))
> > +#endif
> > +		rw = V_ss_fltsz_local * CCV(ccv, t_maxseg);
> > +	else
> > +		rw = V_ss_fltsz * CCV(ccv, t_maxseg);
> > +#endif
> >  
> >  	CCV(ccv, snd_cwnd) = min(rw, CCV(ccv, snd_cwnd));
> >  }
> > 
> > (But using the #else clause instead of the current #if 1 code).  Was this 
> > change in 8 intentional?
> 
> It was. Unlike connection initialisation which still honours ss_fltsz in
> cc_conn_init(), restarting an idle connection based on ss_fltsz seemed
> particularly dubious and as such was omitted from the refactored code.
> 
> The ultimate goal was to remove the ss_fltsz hack completely and
> implement a smarter mechanism, but that hasn't quite happened yet. The
> removal of ss_fltsz from 10.x without providing a replacement mechanism
> is not ideal and should probably be addressed.
> 
> I'm guessing you're not using rfc3390 because you want to override the
> initial window based on specific local knowledge of the path between
> sender and receiver?

Correct, in 7.x we had cranked ss_fltsz up to a really high number to prevent
the congestion window from collapsing when the connection was idle.  We have
a bit of a unique workload in that we are using TCP to reliably forward a
latency-sensitive datagram stream across a WAN connection with high bandwidth
and high RTT.  Most of congestion control seems tuned to bulk transfers rather
than this sort of use case.  The solution we have settled on here is to add a
new TCP socket option to disable idle handling so that when an idle connection
restarts it keeps its prior congestion window.

One other thing I noticed which is may or may not be odd during this, is that
if you have a connection with TCP_NODELAY enabled and you fill your cwnd and
then you get an ACK back for an earlier small segment (less than MSS), TCP
will not send out a "short" segment for the amount of window space released.
Instead, it will wait until a full MSS of space is available before sending
a packet.  I'm not sure if that is the correct behavior with TCP_NODELAY or
if we should send "short" segments in that case.

Anyway, I'll post a patch of the new socket option I described to see if it
is something we want to add.  It is an opt-in thing of course and not enabled
by default.

-- 
John Baldwin