svn commit: r205391 - head/sys/netinet

Sun Mar 21 10:15:59 UTC 2010

On Sat, 20 Mar 2010, Kip Macy wrote:

>  - spread tcp timer callout load evenly across cpus if net.inet.tcp.per_cpu_timers is set to 1
>  - don't default to acquiring tcbinfo lock exclusively in rexmt
>
>  MFC after:	7 days

In the future, it would be helpful if you could make independent changes such 
as these as separate commits.  It makes post-commit review easier, but also 
means that elements of the change can be more easily backed out or merged 
separately.

My experience with TCP timer and locking changes is that 7 days is a very 
short merge time.  I would generally suggest a minimum of a month -- while 
bugs crop up more slowly in HEAD testing, the bug reports are better and 
you're more likely to get people with expertise looking at them there (without 
the risk of taking out large numbers of systems :-).  This is especially true 
currently, when several folks actively have their hands in TCP.

For more complex changes, I've generally been going with a three month MFC 
timeout.

> +static int	per_cpu_timers = 0;
> +SYSCTL_INT(_net_inet_tcp, OID_AUTO, per_cpu_timers, CTLFLAG_RW,
> +    &per_cpu_timers , 0, "run tcp timers on all cpus");
> +
> +#define	INP_CPU(inp)	(per_cpu_timers ? (!CPU_ABSENT(((inp)->inp_flowid % (mp_maxid+1))) ? \
> +		((inp)->inp_flowid % (mp_maxid+1)) : curcpu) : 0)

The 'curcpu' case here violates an invariant we have been trying hard to 
maintain: that callouts for a single inpcb/tcpcb execute on only one CPU at a 
time.  While I don't have any specific bugs in mind (well, perhaps other than 
the well-known TCP timer race that turns out does occur with moderate 
frequency), it is fairly certain that increasing single connection parallelism 
in callouts would significantly increase the chances of hitting those 
bugs/features.  Long and hard experience suggests that changing assumptions in 
the TCP timer code can have subtle but ultimately catastrophic consequences 
(both Andre and I have run into this in the past decade).

Since the 'curcpu' case above is somewhat unlikely in the hardware you're 
using, can I suggest changing it to fall back to CPU 0 in that case as well? 
This would maintain the parallelism you're trying to accomplish but avoid that 
edge case that could have hard to to track down consequences.  This would 
increase my comfort level with an MFC before 8.1.

> @@ -478,11 +485,22 @@ tcp_timer_rexmt(void * xtp)
> 	if (++tp->t_rxtshift > TCP_MAXRXTSHIFT) {
> 		tp->t_rxtshift = TCP_MAXRXTSHIFT;
> 		TCPSTAT_INC(tcps_timeoutdrop);
> +		in_pcbref(inp);
> + 		INP_INFO_RUNLOCK(&V_tcbinfo);
> + 		INP_WUNLOCK(inp);
> + 		INP_INFO_WLOCK(&V_tcbinfo);
> + 		INP_WLOCK(inp);
> + 		if (in_pcbrele(inp)) {
> + 			INP_INFO_WUNLOCK(&V_tcbinfo);
> + 			CURVNET_RESTORE();
> + 			return;
> + 		}
> 		tp = tcp_drop(tp, tp->t_softerror ?
> 			      tp->t_softerror : ETIMEDOUT);
> +		headlocked = 1;
> 		goto out;
> 	}
> -	INP_INFO_WUNLOCK(&V_tcbinfo);
> +	INP_INFO_RUNLOCK(&V_tcbinfo);
> 	headlocked = 0;
> 	if (tp->t_rxtshift == 1) {
> 		/*

Recent survey results for tcp_timer_race leave me a bit worried about changes 
that open up greater potential races in the TCP timer code, and this one does 
worry me.  When tcp_timer_race occurs, holding tcbinfo continuously across the 
timer is what helps mitigate the race.  With dozens of sites reporting 
significantly non-zero instance of the bug, I'm worried that this change could 
allow a conversion from "silently mitigated with a counter bump" to "the 
system panics or similar".  I need to think a bit more about the exact nature 
of the bug, but it could be that MFC'ing this part of the change before 
tcp_timer_race is fixed could have unfortunate stability consequences.

(Since you're working only with long-lived connections with relatively little 
turnover, you may not see this in testing -- however, you can check 
net.inet.tcp.timer_race to see if production systems see it).  Interestingly, 
it wasn't just 8-core systems that appeared in the reports, there were also 
uniprocessor systems.

On an unrelated note: I think it would be useful to rename 'headlocked' in 
this function to 'headwlocked', since it no longer tracks whether tcbinfo is 
locked at all, it just tracks whether it is write-locked.

Robert N M Watson
Computer Laboratory
University of Cambridge