Lock profiling results on TCP and an 8.x project
Robert Watson
rwatson at FreeBSD.org
Sat Oct 20 12:14:30 PDT 2007
Dear all:
This is just an FYI e-mail since I have some early measurement results I
thought I'd share. I've started to look at decomposing tcbinfo, and
preparatory to that, I ran some lock profiling on simple TCP workloads to
explore where and how contention is arising. I used a three-machine
configuration in the netperf cluster -- two client boxes (tiger-1, tiger-3)
linked to a 4-core amd64 server by two dedicated gig-e ethernet (cheetah).
In one test, I ran netserver on cheetah, and in the other, netrate's httpd.
I ran the respective clients on tiger-1 and tiger-3 in both tests. One
important property of this configuration is that because there are independent
network links, the ithreads for the two devices can be scheduled
independently, and run the network stack to completion via direct dispatch,
offering the opportunity for full parallelism in the TCP input path (subject
to limits in our locking model). Each sample was gathered for approximately
10 seconds during the run (40 seconds oof CPUish time over 4 cores).
In the netperf test, I used two independent TCP streams, one per interface
with the TCP stream benchmark in the steady state. This should essentially
consist of cheetah receiving large data packets and sending back small ACKs;
in principle the two workloads are entirely independent, although in practice
TCP locking doesn't allow that, and you get potential interactions due to the
memory allocator, scheduler, etc.
In the http test, I configured 32 workers each on tiger-1 and tiger-3, and
serviced them with a 128-worker httpd on cheetah. The file transfered was 1k,
and it was the same 1k file repeatedly sent via sendfile. Unlike the netperf
test, this resulted in very little steady state TCP traffic--the entire
request fits in one segment, and the file fits in a second segment. Also,
workers are presumably available to move back and forth between work sources,
and theres a single shared listen socket. I.e., opportunities for completely
independent operation are significantly reduced, and there are lots of
globally visible TCP state changes.
Netperf test top wait_total locks:
Seconds Instance
5.75s tcp_usrreq.c:729 (inp)
2.18s tcp_input.c:479 (inp)
1.67s tcp_input.c:400 (tcp)
0.32s uipc_socket.c:1424 (so_rcv)
0.28s tcp_input.c:1191 (so_rcv)
0.20s kern_timeout.c:419 (callout)
0.09s route.c:147 (radix node head)
...
In this test, the top four locking points are responsible for consuming 25% of
available CPU*. We can reasonably assume that the contention on 'inp' and to
a lesser degree 'so_rcv' is between the ithread and netserver processes for
each network interface and that they duke it out significantly due to
generating ACKs, moving data in and out of socket buffers, etc. Only the
'tcp' lock reflects interference between the two otherwise independent
sessions operating over the independent links.
Http test top wait_total locks:
Seconds Instance
8.50s tcp_input.c:400 (tcp)
2.21s tcp_usrreq.c:568 (tcp)
1.96s tcp_usrreq.c:955 (tcp)
0.78s subr_turnstile.c:546 (chain)
0.52s tcp_usrreq.c:606 (inp)
0.16s subr_turnstile.c:536 (chain)
0.13s tcp_input.c:2867 (so_rcv)
0.12s kern_timout.c:419 (callout)
0.08s route.c:147 (radix node head)
...
In this test, the top four locking points are responsible for consuming 34% of
available CPU*. Here, it is clear that the global 'tcp' lock is responsible
for most of the suffering as a result of the lock getting held across the
input path for most packets. This is in contrast to the steady state flow, in
which most packets require only brief tcbinfo lookups and not extended holding
time required for packets that may lead to state changes (syn, fin, etc).
Also, this is the send path, which is directly dispatched from the user code
all the way to the interface queue or device driver, so there's no heavy
contention in the handoff between the two (and hence inp hammering) in this
direction. Jeff and Attilio tell me that the turnstile contention is simply a
symptom of heavy contention on the other mutexes in the work, and not a first
order effect.
These results appear to confirm that we need to look at breaking out the
tcbinfo lock as a key goal for 8.0, with serious thoughts about an MFC once
stabilized and if ABI-safe--these two workloads represent the extremes of TCP
workloads, and real world configurations will fall in between, but as the
number of cores goes up and the desire to spread work over CPUs goes up, so
will contention on a single global lock. An important part of this will be
establishing models for distributing the work over CPUs in such a way as to
avoid contention while still allowing load balancing.
Anyhow, just an early report as I continue my investigation of this issue...
* When talking about percentage of available CPUs, I make the assumption that
due to a sufficient quantity of CPUs, in most cases lock acquisition will
occur as a result of adaptive spinning rather than sleeping. In the netperf
case, this is not true, since the number of potential workers exceeds the
number of CPUs, hence the turnstile contention. However, as sleeping on
locks itself is very expensive, it's reasonable to assume we would recover
a lot of CPU none-the-less.
Robert N M Watson
Computer Laboratory
University of Cambridge
More information about the freebsd-arch
mailing list