Lock profiling results on TCP and an 8.x project

Sat Oct 20 12:14:30 PDT 2007

Dear all:

This is just an FYI e-mail since I have some early measurement results I 
thought I'd share.  I've started to look at decomposing tcbinfo, and 
preparatory to that, I ran some lock profiling on simple TCP workloads to 
explore where and how contention is arising.  I used a three-machine 
configuration in the netperf cluster -- two client boxes (tiger-1, tiger-3) 
linked to a 4-core amd64 server by two dedicated gig-e ethernet (cheetah). 
In one test, I ran netserver on cheetah, and in the other, netrate's httpd. 
I ran the respective clients on tiger-1 and tiger-3 in both tests.  One 
important property of this configuration is that because there are independent 
network links, the ithreads for the two devices can be scheduled 
independently, and run the network stack to completion via direct dispatch, 
offering the opportunity for full parallelism in the TCP input path (subject 
to limits in our locking model).  Each sample was gathered for approximately 
10 seconds during the run (40 seconds oof CPUish time over 4 cores).

In the netperf test, I used two independent TCP streams, one per interface 
with the TCP stream benchmark in the steady state.  This should essentially 
consist of cheetah receiving large data packets and sending back small ACKs; 
in principle the two workloads are entirely independent, although in practice 
TCP locking doesn't allow that, and you get potential interactions due to the 
memory allocator, scheduler, etc.

In the http test, I configured 32 workers each on tiger-1 and tiger-3, and 
serviced them with a 128-worker httpd on cheetah.  The file transfered was 1k, 
and it was the same 1k file repeatedly sent via sendfile.  Unlike the netperf 
test, this resulted in very little steady state TCP traffic--the entire 
request fits in one segment, and the file fits in a second segment.  Also, 
workers are presumably available to move back and forth between work sources, 
and theres a single shared listen socket.  I.e., opportunities for completely 
independent operation are significantly reduced, and there are lots of 
globally visible TCP state changes.

Netperf test top wait_total locks:

Seconds		Instance
5.75s		tcp_usrreq.c:729 (inp)
2.18s		tcp_input.c:479 (inp)
1.67s		tcp_input.c:400 (tcp)
0.32s		uipc_socket.c:1424 (so_rcv)
0.28s		tcp_input.c:1191 (so_rcv)
0.20s		kern_timeout.c:419 (callout)
0.09s		route.c:147 (radix node head)
...

In this test, the top four locking points are responsible for consuming 25% of 
available CPU*.  We can reasonably assume that the contention on 'inp' and to 
a lesser degree 'so_rcv' is between the ithread and netserver processes for 
each network interface and that they duke it out significantly due to 
generating ACKs, moving data in and out of socket buffers, etc.  Only the 
'tcp' lock reflects interference between the two otherwise independent 
sessions operating over the independent links.

Http test top wait_total locks:

Seconds		Instance
8.50s		tcp_input.c:400 (tcp)
2.21s		tcp_usrreq.c:568 (tcp)
1.96s		tcp_usrreq.c:955 (tcp)
0.78s		subr_turnstile.c:546 (chain)
0.52s		tcp_usrreq.c:606 (inp)
0.16s		subr_turnstile.c:536 (chain)
0.13s		tcp_input.c:2867 (so_rcv)
0.12s		kern_timout.c:419 (callout)
0.08s		route.c:147 (radix node head)
...

In this test, the top four locking points are responsible for consuming 34% of 
available CPU*.  Here, it is clear that the global 'tcp' lock is responsible 
for most of the suffering as a result of the lock getting held across the 
input path for most packets.  This is in contrast to the steady state flow, in 
which most packets require only brief tcbinfo lookups and not extended holding 
time required for packets that may lead to state changes (syn, fin, etc). 
Also, this is the send path, which is directly dispatched from the user code 
all the way to the interface queue or device driver, so there's no heavy 
contention in the handoff between the two (and hence inp hammering) in this 
direction.  Jeff and Attilio tell me that the turnstile contention is simply a 
symptom of heavy contention on the other mutexes in the work, and not a first 
order effect.

These results appear to confirm that we need to look at breaking out the 
tcbinfo lock as a key goal for 8.0, with serious thoughts about an MFC once 
stabilized and if ABI-safe--these two workloads represent the extremes of TCP 
workloads, and real world configurations will fall in between, but as the 
number of cores goes up and the desire to spread work over CPUs goes up, so 
will contention on a single global lock.  An important part of this will be 
establishing models for distributing the work over CPUs in such a way as to 
avoid contention while still allowing load balancing.

Anyhow, just an early report as I continue my investigation of this issue...

* When talking about percentage of available CPUs, I make the assumption that
   due to a sufficient quantity of CPUs, in most cases lock acquisition will
   occur as a result of adaptive spinning rather than sleeping.  In the netperf
   case, this is not true, since the number of potential workers exceeds the
   number of CPUs, hence the turnstile contention.  However, as sleeping on
   locks itself is very expensive,  it's reasonable to assume we would recover
   a lot of CPU none-the-less.

Robert N M Watson
Computer Laboratory
University of Cambridge