kern/183659: TCP stack lock contention with short-lived connections
Julien
jcharbon at verisign.com
Mon Nov 4 13:30:02 UTC 2013
>Number: 183659
>Category: kern
>Synopsis: TCP stack lock contention with short-lived connections
>Confidential: no
>Severity: non-critical
>Priority: low
>Responsible: freebsd-bugs
>State: open
>Quarter:
>Keywords:
>Date-Required:
>Class: sw-bug
>Submitter-Id: current-users
>Arrival-Date: Mon Nov 04 13:30:02 UTC 2013
>Closed-Date:
>Last-Modified:
>Originator: Julien
>Release: FreeBSD 9.2
>Organization:
Verisign
>Environment:
FreeBSD host 9.2-RELEASE FreeBSD 9.2-RELEASE #0 r+e612cc1: Thu Oct 17 15:41:43 UTC 2013 root at host:/usr/obj/freebsd-vrsn/sys/GENERIC amd64
>Description:
Our goal is to achieve the highest number of TCP queries per second (QPS)
on a given hardware. A TCP query being defined here as: Establishing a TCP
connection, sending of a small request, reading the small response, closing
the connection.
By configuring a single NIC receive queue bound to a single CPU core, TCP
performance results on FreeBSD are great: We got ~52k QPS before being CPU
bound, and we achieved the same result on Linux.
However, by configuring 4 NIC receive queues each bound to a different core of
the same CPU, results are lower than expected: We got only ~56k QPS, where we
reached ~200k QPS on Linux on the same hardware.
We investigated the cause of this performance scaling issue: The PMC profiling
showed that more than half of CPU time was spent in _rw_rlock() and _rw_wlock_hard(),
and then the lock profing showed a lock contention on ipi_lock of the TCP pcbinfo
structure (ipi_lock being acquired with the INP_INFO_*LOCK macros).
Below our lock profiling result ordered by "total accumulated lock wait
time": (Lock profiling done on vanilla FreeBSD 9.2)
# sysctl debug.lock.prof.stats | head -2; sysctl debug.lock.prof.stats | sort -n> -k 4 -r | head -20 debug.lock.prof.stats:
max wait_max total wait_total count avg wait_avg cnt_hold cnt_lock name
265 39477 4669994 57027602 840000 5 67 0 780171 sys/netinet/tcp_usrreq.c:728 (rw:tcp)
248 39225 9498849 39991390 2044168 4 19 0 1919503 sys/netinet/tcp_input.c:775 (rw:tcp)
234 39474 589181 39241879 840000 0 46 0 702845 sys/netinet/tcp_usrreq.c:984 (rw:tcp)
234 43262 807708 22694780 840000 0 27 0 814240 sys/netinet/tcp_usrreq.c:635 (rw:tcp)
821 39218 8592541 22346613 1106252 7 20 0 1068157 sys/netinet/tcp_input.c:1019 (rw:tcp)
995 37316 1210480 6822269 343692 3 19 0 324585 sys/netinet/tcp_input.c:962 (rw:tcp)
The top 6 lock profiling entries are all related to the same INP_INFO
(rw:tcp) lock, below more details:
#1 In tcp_usr_shutdown()
https://github.com/freebsd/freebsd/blob/releng/9.2/sys/netinet/tcp_usrreq.c#L728
#2 In tcp_input() for SYN/FIN/RST TCP packets
https://github.com/freebsd/freebsd/blob/releng/9.2/sys/netinet/tcp_input.c#L775
#3 In tcp_usr_close()
https://github.com/freebsd/freebsd/blob/releng/9.2/sys/netinet/tcp_usrreq.c#L984
#4 In tcp_usr_accept()
https://github.com/freebsd/freebsd/blob/releng/9.2/sys/netinet/tcp_usrreq.c#L635
#5 In tcp_input() for incoming TCP packets when the corresponding connection
is not in ESTABLISHED state. In general the client ACK packet of TCP three-way
handshake that is going to create the connection.
https://github.com/freebsd/freebsd/blob/releng/9.2/sys/netinet/tcp_input.c#L1019
#6 tcp_input() for incoming TCP packets when the corresponding connection is
in TIME_WAIT state
https://github.com/freebsd/freebsd/blob/releng/9.2/sys/netinet/tcp_input.c#L962
Our explanation for such lock contention is straightforward: Our typical workload
entails this packet sequence:
Received TCP packets: Sent TCP packets:
#1 SYN ->
<- SYN-ACK
#2 ACK ->
#3 query data ->
<- ACK + response data
<- FIN
#4 ACK ->
#5 FIN ->
<- ACK
For #1, #2, #4, #5 received packets the write lock on INP_INFO is required in
tcp_input(), only #3 does not require this lock. Which means that only 1/5th of
all received packets will be proceed in parallel for the entire TCP stack.
Moreover the lock is also required in all major TCP syscalls: tcp_usr_shutdown(),
tcp_usr_close() and tcp_usr_accept().
We are aware than achieving a rate of 200k TCP connections per second is a
specific goal but a better TCP connection setup/teardown scalability could
benefit to other TCP network services as well.
>How-To-Repeat:
Below details to reproduce this performance issue contention using open
source software:
o Software used:
- TCP client: ab version 2.4
- TCP server: nginx version 1.4.2
o Software configurations:
- server: See joined nginx.conf
Core binding on our 12 cores server:
- The 4 NIC receive queues are bounded to cores 0, 1, 2 and 3.
- The 4 nginx workers are bounded to cores 4, 5, 6, and 7.
- client: launch:
$ for i in $(seq 0 11); do \
taskset -c $i ab -c 2000 -n 1000000 http://server/test.html & done
Note: We use the same Linux load driver to load both Linux and FreeBSD,
we did not try to launch ab from a FreeBSD box, sorry.
- 'test.html' HTML page is simpy:
<html><head><title>Title</title></head><body><p>Body</p></body></html>
You should get:
- TCP request size: 92 bytes
- TCP response size: 206 bytes
o Tunables/sysctls parameters:
- Main tunables to tune:
# We want 4 receive queues
hw.ixgbe.num_queues=4
# Other tunables
kern.ipc.maxsockets
kern.ipc.nmbclusters
kern.maxfiles
kern.maxfilesperproc
net.inet.tcp.hostcache.hashsize
net.inet.tcp.hostcache.cachelimit
net.inet.tcp.hostcache.bucketlimit
net.inet.tcp.tcbhashsize
net.inet.tcp.syncache.hashsize
net.inet.tcp.syncache.bucketlimit
- sysctl to tune:
# Values to increase
kern.ipc.maxsockbuf
kern.ipc.somaxconn
net.inet.tcp.maxtcptw
o Monitoring tools:
- We use i7z to check when the server is CPU bounded (from sysutils/i7z port) which
should give you 100% in C0 state "Processor running without halting"
on NIC receive queues bound cores:
----
True Frequency (without accounting Turbo) 3325 MHz
Socket [0] - [physical cores=6, logical cores=6, max online cores ever=6]
CPU Multiplier 25x || Bus clock frequency (BCLK) 133.00 MHz
TURBO ENABLED on 6 Cores, Hyper Threading OFF
Max Frequency without considering Turbo 3458.00 MHz (133.00 x [26])
Max TURBO Multiplier (if Enabled) with 1/2/3/4/5/6 cores is 27x/27x/26x/26x/26x/26x
Real Current Frequency 3373.71 MHz (Max of below)
Core [core-id] :Actual Freq (Mult.) C0% Halt(C1)% C3 % C6 % Temp
Core 1 [0]: 3370.76 (25.34x) 103 0 0 0 43
Core 2 [1]: 3361.13 (25.27x) 103 0 0 0 42
Core 3 [2]: 3373.71 (25.37x) 105 0 0 0 43
Core 4 [3]: 3339.75 (25.11x) 106 0 0 0 42
Core 5 [4]: 3323.90 (24.99x) 65.9 34.1 0 0 42
Core 6 [5]: 3323.90 (24.99x) 65.9 34.1 0 0 41
Socket [1] - [physical cores=6, logical cores=6, max online cores ever=6]
CPU Multiplier 25x || Bus clock frequency (BCLK) 133.00 MHz
TURBO ENABLED on 6 Cores, Hyper Threading OFF
Max Frequency without considering Turbo 3458.00 MHz (133.00 x [26])
Max TURBO Multiplier (if Enabled) with 1/2/3/4/5/6 cores is 27x/27x/26x/26x/26x/26x
Real Current Frequency 3309.13 MHz (Max of below)
Core [core-id] :Actual Freq (Mult.) C0% Halt(C1)% C3 % C6 % Temp
Core 1 [6]: 3309.13 (24.88x) 47.5 52.8 0 0 43
Core 2 [7]: 3308.36 (24.87x) 48 52.3 0 0 42
Core 3 [8]: 3266.36 (24.56x) 1 99.6 0 0 34
Core 4 [9]: 3244.74 (24.40x) 1 99.6 0 0 33
Core 5 [10]: 3274.51 (24.62x) 1 99.4 0 0 38
Core 6 [11]: 3244.08 (24.39x) 1 99.5 0 0 36
C0 = Processor running without halting
C1 = Processor running with halts (States >C0 are power saver)
C3 = Cores running with PLL turned off and core cache turned off
C6 = Everything in C3 + core state saved to last level cache
----
o PMC profiling:
The flat profile of 'unhalted-cycles' of the core 1 should look like:
% cumulative self self total
time seconds seconds calls ms/call ms/call name
55.6 198867.00 198867.00 291942 681.19 681.43 _rw_wlock_hard [7]
2.6 208068.00 9201.00 8961 1026.78 4849.93 tcp_do_segment [14]
2.4 216592.00 8524.00 86 99116.28 102597.15 sched_idletd [26]
2.3 224825.00 8233.00 8233 1000.00 1000.00 _rw_rlock [27]
1.9 231498.00 6673.00 12106 551.21 27396.73 ixgbe_rxeof [2]
1.4 236638.00 5140.00 310457 16.56 1004.01 tcp_input [6]
1.2 241074.00 4436.00 5051 878.24 1000.00 in_pcblookup_hash_locked [36]
1.2 245317.00 4243.00 4243 1000.00 1000.00 bcopy [39]
1.1 249392.00 4075.00 2290 1779.48 3295.95 knote [30]
1.0 252956.00 3564.00 366 9737.70 18562.04 ixgbe_mq_start [31]
0.9 256274.00 3318.00 7047 470.84 3348.41 _syncache_add [16]
0.8 259312.00 3038.00 3038 1000.00 1000.00 bzero [51]
0.8 262269.00 2957.00 6253 472.89 2900.54 ip_output [18]
0.8 264978.00 2709.00 3804 712.15 1009.00 callout_lock [42]
0.6 267185.00 2207.00 2207 1000.00 1000.00 memcpy [64]
0.6 269273.00 2088.00 365 5720.55 7524.50 ixgbe_xmit [56]
0.6 271321.00 2048.00 2048 1000.00 1000.00 bcmp [67]
0.6 273291.00 1970.00 1970 1000.00 1000.73 _rw_runlock [68]
0.5 275188.00 1897.00 1897 1000.00 1000.00 syncache_lookup [71]
And the call graph profile of _rw_wlock_hard of 'unhalted-cycles' of the core 1:
0.68 0.00 1/291942 tcp_slowtimo [331]
10.22 0.00 15/291942 syncache_expand [17]
35.42 0.01 52/291942 in_pcbdrop [77]
126.70 0.04 186/291942 tcp_usr_detach [65]
208.44 0.07 306/291942 tcp_usr_attach [34]
2094.65 0.73 3075/291942 in_pcblookup_hash [22]
196390.89 68.73 288307/291942 tcp_input [6]
[7] 55.6 198867.00 69.60 291942 _rw_wlock_hard [7]
24.96 14.43 39/50 turnstile_trywait [216]
7.20 5.71 12/15 turnstile_cancel [258]
4.00 6.90 3/3 turnstile_wait [275]
3.02 0.00 3/1277 critical_enter [87]
2.13 0.25 2/2061 spinlock_exit <cycle 1> [94]
0.00 1.00 1/1 lockstat_nsecs [320]
>Fix:
This locking contention is tricky to fix: A main pain point to mitigate this lock contention being that this global TCP lock does not only protect globally shared data but also does create a critical section for the whole TCP stack. Then, restructuring TCP stack locking in one shot could lead to complex race conditions and make tests and reviews impractical.
Our current strategy to lower risk is to break down this lock contention mitigation task:
1. Remove INP_INFO lock from locations it is not actually required
2. Replace INP_INFO lock by more specific locks where appropriate
3. Change lock order from "INP_INFO lock (before) INP" to "INP lock (before) INP_INFO"
4. Then push INP_INFO lock deeper in the stack where appropriate
5. Introduce a INP_HASH_BUCKET replacing INP_INFO where appropriate
Note: By "where appropriate" we mean TCP stack parts where INP_INFO is a proven major contention point _and_ change side effects are clear enough to be reviewed. The main goal being to ease test and review of each step.
Patch attached with submission follows:
worker_processes 4;
events {
use kqueue;
worker_connections 1048576;
multi_accept off;
}
timer_resolution 1000ms;
worker_cpu_affinity 000000100000 000000010000 000001000000 000010000000;
http {
types {}
default_type text/html;
sendfile on;
tcp_nopush on;
keepalive_timeout 0;
#keepalive_timeout 65;
lingering_close always;
lingering_time 2s;
lingering_timeout 2s;
etag off;
add_header Last-Modified "";
max_ranges 0;
server_tokens off;
open_file_cache max=10 inactive=120s;
server {
listen 80;
server_name localhost;
location / {
root /usr/local/www/nginx;
index index.html index.htm;
}
error_page 500 502 503 504 /50x.html;
location = /50x.html {
root /usr/local/www/nginx-dist;
}
}
}
>Release-Note:
>Audit-Trail:
>Unformatted:
More information about the freebsd-bugs
mailing list