The tale of a TCP bug

John Baldwin jhb at freebsd.org
Mon Mar 28 18:23:57 UTC 2011


On Monday, March 28, 2011 10:21:00 am John Baldwin wrote:
> On Saturday, March 26, 2011 10:02:12 am Stefan `Sec` Zehl wrote:
> > Hi again,
> > 
> > On Fri, Mar 25, 2011 at 16:40 -0400, John Baldwin wrote:
> > > Reading some more.  I'm trying to understand the breakage in your case.
> > > 
> > > You are saying that FreeBSD is the sender, who has data to send, yet is 
not 
> > > sending any window probes because it never starts the persist timer when 
the 
> > > initial window is zero?  Is that correct?
> > 
> > Yes. The receiver never sends a window update on its own, but when
> > probed will "admit" to a bigger window.
> > 
> > > And the problem is that the code that uses 'adv' to determine if it
> > > sound send a window update to the remote end is falsely succeeding due
> > > to the overflow causing tcp_output() to 'goto send' but that it then
> > > fails to send any data because it thinks the remote window is full?
> > 
> > Yes, as far as I remember (I did that part of debugging 2 Months ago,
> > when I submitted the PR %-) that's what happens.
> > 
> > > So one thing I don't quite follow is how you are having rcv_nxt >
> > > rcv_adv.  I saw this when the other side would send a window probe,
> > > and then the receiving side would take the -1 remaining window and
> > > explode it into the maximum window size when it ACKd.
> > 
> > No, it's not rcv_nxt > rcv_adv. It's
> > 
> > (rcv_adv - rcv_nxt) > min(recwin, (long)TCP_MAXWIN << tp->rcv_scale)
> > 
> > My sample case has (rcv_adv - rcv_nxt) = 65536, but 
> > (TCP_MAXWIN << tp->rcv_scale) = 65535 (as there is no window scaling in
> > effect)
> 
> Ahhhh.
> 
> > > Are you seeing the other end of the connection send a window probe, but 
> > > FreeBSD is not setting the persist timer so that it will send its own 
window 
> > > probes?
> > 
> > No, the dump looks like this:
> > 
> > | 10.42.0.25.44852 > 10.42.0.2.1516: Flags [S], 
> > |    seq 3339144437, win 65535, options [...], length 0
> > 
> > FreeBSD sending the first SYN.
> > [rcv_adv=0, rcv_nxt=0]
> > 
> > | 10.42.0.2.1516 > 10.42.0.25.44852: Flags [S.], 
> > |    seq 42, ack 3339144438, win 0, length 0
> > 
> > The other end SYN|ACKing with a window size of 0.
> > 
> > | 10.42.0.25.44852 > 10.42.0.2.1516: Flags [.], 
> > |    seq 1, ack 1, win 65535, length 0
> > 
> > FreeBSD ACKing, and (correctly) sending no data.
> > [rcv_adv=67779, rcv_nxt=43], thus resulting in adv=-1/0xffffffff
> 
> Ahh, and this is the real bug.  And this goes back to the calculation of
> 'rcv_wnd' in tcp_input().
> 
> How about this:
> 
> Index: tcp_input.c
> ===================================================================
> --- tcp_input.c	(revision 220098)
> +++ tcp_input.c	(working copy)
> @@ -1694,6 +1694,8 @@ tcp_do_segment(struct mbuf *m, struct tcphdr *th,
>  	win = sbspace(&so->so_rcv);
>  	if (win < 0)
>  		win = 0;
> +	if (win > TCP_MAXWIN << tp->rcv_scale)
> +		win = TCP_MAXWIN << tp->rcv_scale;
>  	tp->rcv_wnd = imax(win, (int)(tp->rcv_adv - tp->rcv_nxt));
>  
>  	/* Reset receive buffer auto scaling when not in bulk receive mode. */
> 
> This is basically the same as your patch except that it ensures that
> 'rcv_wnd' is accurate for any other uses.

No, this is not really right.  Your patch from your blog is the best fix 
actually.  The reason we want to let 'win' be larger than TCP_MAXWIN is that 
if the remote end sends more data than we've advertised but we have room in 
the socket buffer, we want to go ahead and accept the data as valid and ACK it 
rather than dropping the data that is beyond rcv_adv.  My change above to 
rcv_wnd would break this.  Also, for the TCPS_SYN_SENT case we don't know what 
'rcv_scale' is until just before we update 'rcv_adv'.  This should be the same 
as your patch:

Index: tcp_input.c
===================================================================
--- tcp_input.c	(revision 220098)
+++ tcp_input.c	(working copy)
@@ -1756,7 +1756,8 @@ tcp_do_segment(struct mbuf *m, struct tcphdr *th,
 				(TF_RCVD_SCALE|TF_REQ_SCALE)) {
 				tp->rcv_scale = tp->request_r_scale;
 			}
-			tp->rcv_adv += tp->rcv_wnd;
+			tp->rcv_adv += imin(tp->rcv_wnd,
+			    TCP_MAXWIN << tp->rcv_scale);
 			tp->snd_una++;		/* SYN is acked */
 			/*
 			 * If there's data, delay ACK; if there's also a FIN

-- 
John Baldwin


More information about the freebsd-net mailing list