panics in soabort with so_count != 0, one possible solution to one cause.
Steve Read
steve.read at netasq.com
Wed Jan 9 15:22:19 UTC 2013
Context for this message:
http://www.freebsd.org/cgi/query-pr.cgi?pr=145825&cat=kern
<http://www.freebsd.org/cgi/query-pr.cgi?pr=145825&cat=kern>
kern/145825: [panic] panic: soabort: so_count
AND
http://www.freebsd.org/cgi/query-pr.cgi?pr=159621
kern/159621: [tcp] [panic] panic: soabort: so_count
The two PRs are essentially reporting the same thing, and I have seen
evidence of people reporting this panic against kernels as old as 6.2.
== Scenario ==
The basic scenario is:
1. There is a local listening TCP socket. A userland thread is waiting
on a kqueue, and will eventually call accept() on this socket.
2. A new TCP connection arrives that matches this TCP socket. Syncache
hangs on to the connection until the three-way handshake is complete
(i.e. the ACK arrives).
3. At this point, syncache_socket() calls sonewconn() and passes
SS_ISCONNECTED. sonewconn() as a result hands the new socket off to the
accept queue and wakes up the userland thread (marks the listening
socket "readable", sends a kqueue notification, etc.).
4. Something goes wrong during the rest of syncache_socket(), as a
result of which it calls soabort().
== Consequence ==
On a single-CPU machine, the netisr thread that called syncache_socket()
blocks out the userland thread until it has finished, so so_count of the
new connected socket is still zero when syncache_socket() calls
soabort(). (It's not absolutely guaranteed, as there are calls to
locking functions along the way, but it usually happens.)
On a multi-CPU machine of any sort, the userland thread resumes
immediately that it is woken up, and it is possible (but not guaranteed)
for it to grab the socket and increment its so_count before
syncache_socket() calls soabort().
I have a core which shows the netisr thread hitting the panic in
soabort(), while the expected userland thread (on a different CPU) is
still in the kernel, churning through the post-pickup part of accept().
== Proposed solution ==
My proposed solution to this issue is:
1. Replace SS_ISCONNECTED with 0 in the call to sonewconn() to prevent
it from waking up the listening thread.
2. At the "end" of syncache_socket(), call soisconnected(), passing the
new socket. This will issue the wakeup after syncache_socket() has
finished preparing itself, and in particular after the last possible
call to soabort().
I'm concerned, of course, that this may cause some unobvious fallout
somewhere, but I can't see for the moment what it would be. Any advice
would be welcome.
== Patch that applies the proposed solution ==
A patch that would apply to kernel 8.3 (the basic scenario appears to
still be feasible with HEAD, and the code is very similar):
======
--- netinet/tcp_syncache.c.orig 2013-01-09 13:18:05.000000000 +0000
+++ netinet/tcp_syncache.c 2013-01-09 14:03:54.000000000 +0000
@@ -638,7 +638,7 @@
* connection when the SYN arrived. If we can't create
* the connection, abort it.
*/
- so = sonewconn(lso, SS_ISCONNECTED);
+ so = sonewconn(lso, 0);
if (so == NULL) {
/*
* Drop the connection; we will either send a RST or
@@ -831,6 +831,8 @@
INP_WUNLOCK(inp);
+ soisconnected(so);
+
TCPSTAT_INC(tcps_accepts);
return (so);
======
-- Steve Read
steve.read at netasq.org
More information about the freebsd-net
mailing list