panics in soabort with so_count != 0, one possible solution to one cause.

Wed Jan 9 17:12:56 UTC 2013

On Wed, Jan 09, 2013 at 04:22:15PM +0100, Steve Read wrote:
S> Context for this message:
S> http://www.freebsd.org/cgi/query-pr.cgi?pr=145825&cat=kern 
S> <http://www.freebsd.org/cgi/query-pr.cgi?pr=145825&cat=kern>
S> kern/145825: [panic] panic: soabort: so_count
S> 
S> AND
S> 
S> http://www.freebsd.org/cgi/query-pr.cgi?pr=159621
S> kern/159621: [tcp] [panic] panic: soabort: so_count
S> 
S> The two PRs are essentially reporting the same thing, and I have seen 
S> evidence of people reporting this panic against kernels as old as 6.2.
S> 
S> == Scenario ==
S> The basic scenario is:
S> 1. There is a local listening TCP socket.  A userland thread is waiting 
S> on a kqueue, and will eventually call accept() on this socket.
S> 2. A new TCP connection arrives that matches this TCP socket.  Syncache 
S> hangs on to the connection until the three-way handshake is complete 
S> (i.e. the ACK arrives).
S> 3. At this point, syncache_socket() calls sonewconn() and passes 
S> SS_ISCONNECTED.  sonewconn() as a result hands the new socket off to the 
S> accept queue and wakes up the userland thread (marks the listening 
S> socket "readable", sends a kqueue notification, etc.).
S> 4. Something goes wrong during the rest of syncache_socket(), as a 
S> result of which it calls soabort().
S> 
S> == Consequence ==
S> On a single-CPU machine, the netisr thread that called syncache_socket() 
S> blocks out the userland thread until it has finished, so so_count of the 
S> new connected socket is still zero when syncache_socket() calls 
S> soabort().  (It's not absolutely guaranteed, as there are calls to 
S> locking functions along the way, but it usually happens.)
S> 
S> On a multi-CPU machine of any sort, the userland thread resumes 
S> immediately that it is woken up, and it is possible (but not guaranteed) 
S> for it to grab the socket and increment its so_count before 
S> syncache_socket() calls soabort().
S> 
S> I have a core which shows the netisr thread hitting the panic in 
S> soabort(), while the expected userland thread (on a different CPU) is 
S> still in the kernel, churning through the post-pickup part of accept().
S> 
S> == Proposed solution ==
S> My proposed solution to this issue is:
S> 1. Replace SS_ISCONNECTED with 0 in the call to sonewconn() to prevent 
S> it from waking up the listening thread.
S> 2. At the "end" of syncache_socket(), call soisconnected(), passing the 
S> new socket.  This will issue the wakeup after syncache_socket() has 
S> finished preparing itself, and in particular after the last possible 
S> call to soabort().
S> 
S> I'm concerned, of course, that this may cause some unobvious fallout 
S> somewhere, but I can't see for the moment what it would be.  Any advice 
S> would be welcome.
S> 
S> == Patch that applies the proposed solution ==
S> A patch that would apply to kernel 8.3 (the basic scenario appears to 
S> still be feasible with HEAD, and the code is very similar):
S> 
S> ======
S> --- netinet/tcp_syncache.c.orig    2013-01-09 13:18:05.000000000 +0000
S> +++ netinet/tcp_syncache.c    2013-01-09 14:03:54.000000000 +0000
S> @@ -638,7 +638,7 @@
S>        * connection when the SYN arrived.  If we can't create
S>        * the connection, abort it.
S>        */
S> -    so = sonewconn(lso, SS_ISCONNECTED);
S> +    so = sonewconn(lso, 0);
S>       if (so == NULL) {
S>           /*
S>            * Drop the connection; we will either send a RST or
S> @@ -831,6 +831,8 @@
S> 
S>       INP_WUNLOCK(inp);
S> 
S> +    soisconnected(so);
S> +
S>       TCPSTAT_INC(tcps_accepts);
S>       return (so);

AFAIU, in head this race was fixed by r243627. Can Vijay and Andre comment?

-- 
Totus tuus, Glebius.