Question on TCP reassembly counter

Fri Oct 1 10:26:22 UTC 2010

Hi,

In the following is an observation when testing our XLR/XLS network
driver with 16 concurrent instances of netperf on FreeBSD-CURRENT.
Based on this observation, I have a question on which I hope to get
some understanding from here.

When running 16 concurrent netperf instances (each for about 20
seconds), it was found that after some number of runs performance
degraded badly (almost by a factor of 5). All subsequent runs remained
so. Started debugging this from TCP-side as other driver tests were
doing fine for comparably long durations on same board+s/w.

netstat indicated the following:

$ netstat -s -f inet -p tcp | grep discarded
                0 discarded for bad checksums
                0 discarded for bad header offset fields
                0 discarded because packet too short
                7318 discarded due to memory problems

Then, traced the "discarded due to memory problems" to the following counter:

$ sysctl -a net.inet.tcp.reass
net.inet.tcp.reass.overflows: 7318
net.inet.tcp.reass.maxqlen: 48
net.inet.tcp.reass.cursegments: 1594    <--- // corresponds to
V_tcp_reass_qsize variable
net.inet.tcp.reass.maxsegments: 1600

Our guess for the need for reassembly (in this low-packet-loss test
setup) was the lack of per-flow classification in the driver, causing
it to spew incoming packets across the 16 h/w cpus instead of packets
of a flow being sent to the same cpu. While we are working on
addressing this driver limitation, debugged further to see how/why the
V_tcp_reass_qsize grew (assuming that out-of-order segments should
have dropped to zero at the end of the run). It was seen that this
counter was actually growing up from the initial runs but only when it
reached near to maxsgements, perf degradation was seen. Then, started
looking at vmstat also to see how many of the reassembly segments were
lost. But, there were no segments lost. We could not reconcile "no
lost segments" with "growth of this counter across test runs".

$ sysctl net.inet.tcp.reass ; vmstat -z | egrep "FREE|mbuf|tcpre"
net.inet.tcp.reass.overflows: 0
net.inet.tcp.reass.maxqlen: 48
net.inet.tcp.reass.cursegments: 147
net.inet.tcp.reass.maxsegments: 1600
ITEM                   SIZE  LIMIT     USED     FREE      REQ FAIL SLEEP
mbuf_packet:            256,      0,    4096,    3200, 5653833,   0,   0
mbuf:                   256,      0,       1,    2048, 4766910,   0,   0
mbuf_cluster:          2048,  25600,    7296,       6,    7297,   0,   0
mbuf_jumbo_page:       4096,  12800,       0,       0,       0,   0,   0
mbuf_jumbo_9k:         9216,   6400,       0,       0,       0,   0,   0
mbuf_jumbo_16k:       16384,   3200,       0,       0,       0,   0,   0
mbuf_ext_refcnt:          4,      0,       0,       0,       0,   0,   0
tcpreass:                20,   1690,       0,     845, 1757074,   0,   0

In view of these observations, my question is: is it possible for the
V_tcp_reass_qsize variable to be unsafely updated on SMP ? (The
particular flavor of XLS that was used in the test had 4 cores with 4
h/w threads/core). I see that the tcp_reass function assumes some lock
is taken but not sure if it is the per-socket or the global tcp lock.

Any inputs on what I missed are most welcome.

Thanks,
Sriram Gorti
Netlogic Microsystems