cvs commit: src/sys/netinet in_pcb.c tcp_subr.c tcp_timer.c tcp_var.h

Wed Sep 6 07:32:15 PDT 2006

  Mike,

On Wed, Sep 06, 2006 at 09:16:03AM -0500, Mike Silbersack wrote:
M> > Modified files:
M> >   sys/netinet          in_pcb.c tcp_subr.c tcp_timer.c tcp_var.h
M> > Log:
M> > o Backout rev. 1.125 of in_pcb.c. It appeared to behave extremely
M> >   bad under high load. For example with 40k sockets and 25k tcptw
M> >   entries, connect() syscall can run for seconds. Debugging showed
M> >   that it iterates the cycle millions times and purges thousands of
M> >   tcptw entries at a time.
M> >   Besides practical unusability this change is architecturally
M> >   wrong. First, in_pcblookup_local() is used in connect() and bind()
M> >   syscalls. No stale entries purging shouldn't be done here. Second,
M> >   it is a layering violation.
M> 
M> So you're returning to the behavior where the system chokes and stops all 
M> outbound TCP connections because everything is in the timewait state? 
M> There has to be a way to fix the problem without removing this heuristic 
M> entirely.
M> 
M> How did you run your tests?

Since we upgraded our web frontends from RELENG_4 to RELENG_6 half a
year ago, we were noticing a small packet loss rate that was definitely
a function from the server load. If we removed half of the frontends
from the farm thus doubling the load, the lags could be measured in
seconds, while having idle CPU time between lags. And our reference
RELENG_4 box stand that load easily.

First we suspected that this is some driver or lower network stack issue,
we tried different hardware and different network settings - polling, no
polling, direct ISR dispatch.

Then we found the CPU hog in the in_pcblookup_local(). I've added
counters and gathered stats via ktr(4). When a lag occured, the
following data was gathered:

112350 return 0x0, iterations 0, expired 0
112349 return 0xc5154888, iterations 19998, expired 745
112348 return 0xc5154930, iterations 569, expired 20
112347 return 0xc51549d8, iterations 2084890, expired 9836
112346 return 0xc5154a80, iterations 9382, expired 524
112345 return 0xc5154bd0, iterations 64984631, expired 5501

The "iterations" counter counts number of iterations in this
cycle:

			LIST_FOREACH(inp, &phd->phd_pcblist, inp_portlist)

The "expire" counts number of tcp_twclose() calls.

So, for one connect() syscall the in_pcblookup_local() was called
5 times, each time doing a enormous amount of "work" inside. On the
sixth time it succeded finding unused port.

M> > o Return back the tcptw purging cycle to tcp_timer_2msl_tw(),
M> >   that was removed in rev. 1.78 by rwatson. The commit log of this
M> >   revision tells nothing about the reason cycle was removed. Now
M> >   we need this cycle, since major cleaner of stale tcptw structures
M> >   is removed.
M> 
M> Looks good, this is probably the reason for the code in in_pcb behaving so 
M> poorly.  Did you test just this change alone to see if it solved the 
M> problem that you were seeing?

1.78 hasn't yet been merged to RELENG_6, and we faced the problem on
RELENG_6 boxes where the periodic merging cycle is present. So the
problem is not in 1.78 of tcp_timer.c. We have a lot of tcptw entries
because we have a very big connection rate, not because they are
leaked or not purged.

-- 
Totus tuus, Glebius.
GLEBIUS-RIPN GLEB-RIPE