[Bug 212920] Li loaded web server cath race condition on _close () from /lib/libc.so.7 with accf_http

Fri Sep 23 10:10:05 UTC 2016

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=212920

            Bug ID: 212920
           Summary: Li loaded web server cath race condition on _close ()
                    from /lib/libc.so.7 with accf_http
           Product: Base System
           Version: 10.3-STABLE
          Hardware: amd64
                OS: Any
            Status: New
          Severity: Affects Some People
          Priority: ---
         Component: kern
          Assignee: freebsd-bugs at FreeBSD.org
          Reporter: fbsd98816551 at avksrv.org
                CC: freebsd-amd64 at FreeBSD.org
                CC: freebsd-amd64 at FreeBSD.org

Hello!

Recently we upgraded our high loaded web server to FREEBSD-STABLE 10.3 r305091
and got problem with NGINX (nginx-1.10.1_2,2 compiled from latest ports with
most default settings). After some time one worker stopped answer requests and
top command shows it in state soclos
 1072 nobody           1  22    0  1698M 65680K soclos  5   0:13   0.00% nginx

after short while next worker stops in same state and so on untill all workers
become "soclos" and web server stops serve requests (but still accept
connections, which die on timeout after client sent a request). Increasing
workers count only move problem to next half an hour.

Restarting nginx fix for some not so long time. Server is more or less high
loaded with 1000-2000 request/sec. Actually server is frontend proxy with
proxy_cache functionality. We tried on 2 different phisical servers with
actually different NICs and CPUs. When we returned kernel (only kernel and
modules at /boot/kernel, not world) to r302223, problem gone.

We tried to upgrade to yesterdey's r306194. Problem is still here. Something 
changed between end of Jun and end of Aug in kernel code what generate a
problem

backtrace from nginx while it in "soclos"

#0  0x0000000801a17d28 in _close () from /lib/libc.so.7
#1  0x000000080098a925 in pthread_suspend_all_np () from /lib/libthr.so.3
#2  0x00000000004329b9 in ngx_close_connection (c=0x869c1de70) at
src/core/ngx_connection.c:1169
#3  0x0000000000486370 in ngx_http_close_connection (c=0x869c1de70) at
src/http/ngx_http_request.c:3543
#4  0x0000000000488e86 in ngx_http_close_request (r=0x80244c050, rc=408) at
src/http/ngx_http_request.c:3406
#5  0x000000000048d9ed in ngx_http_process_request_headers (rev=0x807810b70) at
src/http/ngx_http_request.c:1202
#6  0x000000000044fdbd in ngx_event_expire_timers () at
src/event/ngx_event_timer.c:94
#7  0x000000000044e60f in ngx_process_events_and_timers (cycle=0x802488050) at
src/event/ngx_event.c:256
#8  0x000000000045f406 in ngx_worker_process_cycle (cycle=0x802488050,
data=0xa) at src/os/unix/ngx_process_cycle.c:753
#9  0x000000000045ae7c in ngx_spawn_process (cycle=0x802488050, proc=0x45f2f0
<ngx_worker_process_cycle>, data=0xa, name=0x53ecea "worker process",
respawn=-3) at src/os/unix/ngx_process.c:198
#10 0x000000000045cc89 in ngx_start_worker_processes (cycle=0x802488050, n=16,
type=-3) at src/os/unix/ngx_process_cycle.c:358
#11 0x000000000045c486 in ngx_master_process_cycle (cycle=0x802488050) at
src/os/unix/ngx_process_cycle.c:130
#12 0x0000000000413288 in main (argc=1, argv=0x7fffffffead0) at
src/core/nginx.c:367

(gdb) list src/core/ngx_connection.c:1169
1164   
1165        if (c->shared) {
1166            return;
1167        }
1168   
1169        if (ngx_close_socket(fd) == -1) { <<<<<<<<
1170   
1171            err = ngx_socket_errno;
1172   
1173            if (err == NGX_ECONNRESET || err == NGX_ENOTCONN) {

and actually called close(fd):
#define ngx_close_socket    close

All TCP sessions opened by worker frose in present state.

Same if we do not load and do not use in nginx config accf_http, problem not
repeased with all 3 tested kernels

kernel GENERIC and only extra accf_http ipmi smbus mfip ums zfs and opensolaris
 module loaded

As long as  accf_http did some good for our server, we can not simple disabe
the module in production env.

I'll debug more, but as long as I'm not is good C programmer, it will take some
time. If someone knows what changed in related functions, may be it will be
faster to check from that side..

-- 
You are receiving this mail because:
You are on the CC list for the bug.