TCP connections stuck in persist state

Tue Feb 8 18:26:24 UTC 2011

I ran into a problem recently where a TCP socket seemed to never exit persist 
mode.  What would happen is that the sender was blasting data faster than the 
receiver could receive it.  When the receiver read some data, the sender would 
start sending again and everything would resume.  However, after 3-4 instances 
of this, the sender would decide to not resume sending data when the receiver 
opened the window.  Instead, it would slowly send a byte every few seconds via 
the persist timer even though the receiver was advertising a 64k window when 
it ACKd each of the window probes that was sent by the sender.

I dug around in kgdb and found that both snd_cwnd and snd_ssthresh were set to 
0 on the sender side.  I think this means that the send window is effectively 
permamently stuck at zero as a result of this.  (The tcpcb is also 
IN_FASTRECOVERY() on the sender side, probably from the storm of duplicate 
acks from the receiver when it sends a bunch of window updates (see my earlier
e-mail to net@ for the source of duplicate ACKs.)  Anyway, I think that this
code in tcp_input() is what keeps the window at zero:

                /*
                 * If the congestion window was inflated to account
                 * for the other side's cached packets, retract it.
                 */
                if (tcp_do_newreno || (tp->t_flags & TF_SACK_PERMIT)) {
                        if (IN_FASTRECOVERY(tp)) {
                                if (SEQ_LT(th->th_ack, tp->snd_recover)) {
                                        if (tp->t_flags & TF_SACK_PERMIT)
                                                tcp_sack_partialack(tp, th);
                                        else
                                                tcp_newreno_partial_ack(tp, th);
                                } else {
                                        /*
                                         * Out of fast recovery.
                                         * Window inflation should have left us
                                         * with approximately snd_ssthresh
                                         * outstanding data.
                                         * But in case we would be inclined to
                                         * send a burst, better to do it via
                                         * the slow start mechanism.
                                         */
                                        KASSERT(tp->snd_ssthresh != 0,
                                            ("using bogus snd_ssthresh"));
                                        if (SEQ_GT(th->th_ack +
                                                        tp->snd_ssthresh,
                                                   tp->snd_max))
                                                tp->snd_cwnd = tp->snd_max -
                                                                th->th_ack +
                                                                tp->t_maxseg;
                                        else
                                                tp->snd_cwnd = tp->snd_ssthresh;
                                }
                        }

Specifically, since snd_recover and snd_una seem to keep advancing in
lock-step with each window update, I think it ends up falling down to the
last statement each time where snd_cwnd = snd_ssthresh thus keeping
snd_cwnd at 0.  This then causes a zero send window in tcp_output():

        sendwin = min(tp->snd_wnd, tp->snd_cwnd);
        sendwin = min(sendwin, tp->snd_bwnd);

Now, looking at the code, I can see no way that snd_ssthresh should ever be 
zero.  It seems to always be calculated from some number of segments times
t_maxseg.  The one exception to this rule is when it is restored from 
snd_ssthresh_prev due to a bad retransmit (this example is from tcp_input()):

                /*
                 * If we just performed our first retransmit, and the ACK
                 * arrives within our recovery window, then it was a mistake
                 * to do the retransmit in the first place.  Recover our
                 * original cwnd and ssthresh, and proceed to transmit where
                 * we left off.
                 */
                if (tp->t_rxtshift == 1 && (int)(ticks - tp->t_badrxtwin) < 0) {
                        ++tcpstat.tcps_sndrexmitbad;
                        tp->snd_cwnd = tp->snd_cwnd_prev;
                        tp->snd_ssthresh = tp->snd_ssthresh_prev;
                        tp->snd_recover = tp->snd_recover_prev;
                        if (tp->t_flags & TF_WASFRECOVERY)
                                ENTER_FASTRECOVERY(tp);
                        tp->snd_nxt = tp->snd_max;
                        tp->t_badrxtwin = 0;    /* XXX probably not required */
                }

So then my working theory is that somehow, snd_ssthresh_prev is being used
when it hasn't been initialized.  I then checked 'ticks' on my host and found 
that it had wrapped to a negative integer value.  So that means that when 
t_badrxtwin is zero (which is it's default value), the second half of that
expression is always true.  This is fine, however, so long as t_rxtshift is 
always 0 so long as snd_ssthresh_prev and friends aren't initialized.  That 
brings me to tcp_setpersist():

void
tcp_setpersist(struct tcpcb *tp)
{
        int t = ((tp->t_srtt >> 2) + tp->t_rttvar) >> 1;
        int tt;

        if (tcp_timer_active(tp, TT_REXMT))
                panic("tcp_setpersist: retransmit pending");
        /*
         * Start/restart persistance timer.
         */
        TCPT_RANGESET(tt, t * tcp_backoff[tp->t_rxtshift],
                      TCPTV_PERSMIN, TCPTV_PERSMAX);
        tcp_timer_activate(tp, TT_PERSIST, tt);
        if (tp->t_rxtshift < TCP_MAXRXTSHIFT)
                tp->t_rxtshift++;
}

Suppose that a connection has never had to retransmit a packet, so 
snd_ssthresh_prev and friends are all set to their initial values of zero.  
Suppose that the receiver's socket then fills up eventually triggering the 
persist timer.  Now when tcp_setpersist() is called, it will set t_rxtshift to 
1 with all those *_prev fields invalid.  The next ACK that arrives in 
tcp_input() with the 'ticks' value negative will trash snd_cwnd and friends 
from the bogus *_prev fields.

I believe this was exposed relatively recently when I fixed problems with 
'ticks' rollover a while back as the tests used to be
'if (ticks < tp->t_badrxtwin)' which have a different set of bugs, but that
would explain why this hasn't been seen before.

I was never able to come up with a small test case for this.  The only way I
could reproduce it and test the fix was using mysql (unfortunately) though we
believe we have also seen this in different production workloads as well.

After some whiteboard discussion with gnn@ I came up with a firmer timeline of
exactly how this state was triggered on the sender:

1) The receiver is "busy" for a while and does not service the socket causing
   the receive buffer to fill.  Once the receiver does start reading the
   socket, it drains the entire socket resulting in a small storm of ACKs all
   for the same sequence number.  The window is growing during the ACKs,
   however, due to a bug in our window update code and large scale values (we
   have a scale factor of 8k because we increase the maximum socket buffer
   size), several of the ACKs end up being duplicates (e.g. 3 ACKs with a
   window of 8k then 3 ACKs with a window of 16k, then 3 with 24k, etc.).  I
   think these ACKs trigger fast recovery so that IN_FASTRECOVERY() is true
   for the sender.

2) This cycle happens a second time on receiver (getting busy, then a flood
   of ACKs).  In this case, the receiver is busy long enough to cause the
   sender to enter the persist state which sets t_rxtshift to one.

3) At this point an ACK arrives after the receiver started reading again.
   Because ticks is -ve and badrxtwin is zero and t_rxtshift is one, the 
   code thinks this ACK should trigger 'bad retransmit recovery'.  However,
   snd_ssthresh_prev and snd_cwnd_prev are both zero, so this sets both
   snd_ssthresh and snd_cwnd to zero.

4) Because IN_FASTRECOVERY() is true, then when an ACK arrives, it triggers
   either NewReno or SACK partial ACK handling code which clamps snd_cwnd
   to be no longer than snd_ssthresh, so snd_cwnd is zero.  It seems that once
   the socket is "stuck", then each ACK to the window probes is treated as
   a partial ACK perpetuating the zero congestion window forever.

The solution I came up with was to add a new flag to indicate when the *_prev
fields are valid and test that before triggering bad retransmit recovery.
The patch:

Index: netinet/tcp_input.c
===================================================================

--- netinet/tcp_input.c	(.../mirror/FreeBSD/stable/7/sys)	(revision 219412)
+++ netinet/tcp_input.c	(.../stable/7/sys)	(revision 219412)
@@ -1030,6 +1031,7 @@
 				 * "bad retransmit" recovery.
 				 */
 				if (tp->t_rxtshift == 1 &&
+				    tp->t_flags & TF_PREVVALID &&
 				    (int)(ticks - tp->t_badrxtwin) < 0) {
 					++tcpstat.tcps_sndrexmitbad;
 					tp->snd_cwnd = tp->snd_cwnd_prev;
@@ -1039,6 +1041,7 @@
 					if (tp->t_flags & TF_WASFRECOVERY)
 					    ENTER_FASTRECOVERY(tp);
 					tp->snd_nxt = tp->snd_max;
+					tp->t_flags &= ~TF_PREVVALID;
 					tp->t_badrxtwin = 0;
 				}
 
@@ -1961,7 +1968,8 @@
 		 * original cwnd and ssthresh, and proceed to transmit where
 		 * we left off.
 		 */
-		if (tp->t_rxtshift == 1 && (int)(ticks - tp->t_badrxtwin) < 0) {
+		if (tp->t_rxtshift == 1 && tp->t_flags & TF_PREVVALID &&
+		    (int)(ticks - tp->t_badrxtwin) < 0) {
 			++tcpstat.tcps_sndrexmitbad;
 			tp->snd_cwnd = tp->snd_cwnd_prev;
 			tp->snd_ssthresh = tp->snd_ssthresh_prev;
@@ -1969,6 +1977,7 @@
 			if (tp->t_flags & TF_WASFRECOVERY)
 				ENTER_FASTRECOVERY(tp);
 			tp->snd_nxt = tp->snd_max;
+			tp->t_flags &= ~TF_PREVVALID;
 			tp->t_badrxtwin = 0;	/* XXX probably not required */
 		}
 
Index: netinet/tcp_timer.c
===================================================================
--- netinet/tcp_timer.c	(.../mirror/FreeBSD/stable/7/sys)	(revision 219412)
+++ netinet/tcp_timer.c	(.../stable/7/sys)	(revision 219412)
@@ -476,7 +476,9 @@
 		else
 		  tp->t_flags &= ~TF_WASFRECOVERY;
 		tp->t_badrxtwin = ticks + (tp->t_srtt >> (TCP_RTT_SHIFT + 1));
-	}
+		tp->t_flags |= TF_PREVVALID;
+	} else
+		tp->t_flags &= ~TF_PREVVALID;
 	tcpstat.tcps_rexmttimeo++;
 	if (tp->t_state == TCPS_SYN_SENT)
 		rexmt = TCP_REXMTVAL(tp) * tcp_syn_backoff[tp->t_rxtshift];
Index: netinet/tcp_var.h
===================================================================
--- netinet/tcp_var.h	(.../mirror/FreeBSD/stable/7/sys)	(revision 219412)
+++ netinet/tcp_var.h	(.../stable/7/sys)	(revision 219412)
@@ -201,6 +221,7 @@
 #define	TF_NEEDSYN	0x000400	/* send SYN (implicit state) */
 #define	TF_NEEDFIN	0x000800	/* send FIN (implicit state) */
 #define	TF_NOPUSH	0x001000	/* don't push */
+#define	TF_PREVVALID	0x002000	/* saved values for bad rxmit valid */
 #define	TF_MORETOCOME	0x010000	/* More data to be appended to sock */
 #define	TF_LQ_OVERFLOW	0x020000	/* listen queue overflow */
 #define	TF_LASTIDLE	0x040000	/* connection was previously idle */
Index: netinet/tcp_output.c
===================================================================
--- netinet/tcp_output.c	(.../mirror/FreeBSD/stable/7/sys)	(revision 219412)
+++ netinet/tcp_output.c	(.../stable/7/sys)	(revision 219412)
@@ -1258,6 +1269,7 @@
 	int t = ((tp->t_srtt >> 2) + tp->t_rttvar) >> 1;
 	int tt;
 
+	tp->t_flags &= ~TF_PREVVALID;
 	if (tcp_timer_active(tp, TT_REXMT))
 		panic("tcp_setpersist: retransmit pending");
 	/*


-- 
John Baldwin