Unable to kill a non-zombie process with -9

Wed Oct 15 09:41:43 UTC 2014

Hi!

Today the problem reoccurred.
I've now debugged the problem a little furter.

I'm starting snort (as root).

<<<lots of startup logs for pid 22646>>>
Oct 15 08:46:59 snort[22646]: Initializing daemon mode
Oct 15 08:46:59 snort[22648]: Daemon initialized, signaled parent pid: 
22646
Oct 15 08:46:59 snort[22648]: Reload thread starting...
Oct 15 08:46:59 snort[22648]: Reload thread started, thread 0x8146e8800 
(22648)
End of log.

Error! Nothing more happens with the snort process!
Normally it should continue and log these lines as well:

snort[nnn]: Decoding Ethernet
snort[nnn]: Checking PID path...
snort[nnn]: PID path stat checked out ok, PID path set to /var/run/
snort[nnn]: Writing PID "7627" to file "/var/run//snort_mon0.pid"
snort[nnn]: Chroot directory = /usr/foobar/log
snort[nnn]: Set gid to 100
snort[nnn]: Set uid to 100
snort[nnn]:
snort[nnn]:         --== Initialization Complete ==--
snort[nnn]: Commencing packet processing (pid=nnn)

When looking at this half-started snort process with 'ps', it looks like 
this:

ps faxulwwj 22648
USER   PID %CPU %MEM    VSZ    RSS TT  STAT STARTED     TIME COMMAND 
UID PPID CPU PRI NI MWCHAN  PGID   SID JOBC
root 22648 51.8  1.1 488552 179344  -  Rs    8:46AM 53:06.52 
/usr/local/bin/s   0    1   0  88  0 -      22648 22648    0

The process is still owned by root, so just as the missing log lines are 
saying, it has not yet performed any change of uid/gid.

So there seem to be two questions.

Q1)
What happens between "Reload thread started, thread 0x8146e8800 (22648)" 
and "Decoding Ethernet"?
Apparently something goes wrong here on FreeBSD 10.0.
(this problem does not always occur, sometimes snort start just fine)

Q2)
When the process has frozen in this half-started state, it can't be killed 
even with a -9. Why?

John-Mark asked me for some debugging info. Here it is:

I now run 'kill 22648' on the above semi-started process:

     USER   PID %CPU %MEM    VSZ    RSS TT  STAT STARTED     TIME COMMAND 
UID PPID CPU PRI NI MWCHAN  PGID   SID JOBC
old root 22648 51.8  1.1 488552 179344  -  Rs    8:46AM 53:06.52 
/usr/local/bin/s   0    1   0  88  0 -      22648 22648    0
new root 22648 52.3  1.1 488552 179344  -  Rs    8:46AM 53:36.48 
/usr/local/bin/s   0    1   0  52  0 -      22648 22648    0

No change.

kill -9 22648

     USER   PID %CPU %MEM    VSZ    RSS TT  STAT STARTED     TIME COMMAND 
UID PPID CPU PRI NI MWCHAN  PGID   SID JOBC
old root 22648 51.8  1.1 488552 179344  -  Rs    8:46AM 53:06.52 
/usr/local/bin/s   0    1   0  88  0 -      22648 22648    0
new root 22648 37.7  1.1 488552 179344  -  Ts    8:46AM 53:50.87 
/usr/local/bin/s   0    1   0  52  0 -      22648 22648    0

Less CPU-usage and STAT changed to "Ts".

kill -CONT 22648

     USER   PID %CPU %MEM    VSZ    RSS TT  STAT STARTED     TIME COMMAND 
UID PPID CPU PRI NI MWCHAN  PGID   SID JOBC
old root 22648 51.8  1.1 488552 179344  -  Rs    8:46AM 53:06.52 
/usr/local/bin/s   0    1   0  88  0 -      22648 22648    0
new root 22648  0.0  1.1 488552 179344  -  Ts    8:46AM 53:50.88 
/usr/local/bin/s   0    1   0  52  0 -      22648 22648    0

No change except cpu is down to 0.

I now start 'kgdb'
info threads
I found two threads for snort, doing a bt for both of them:
   372 Thread 100602 (PID=22648: snort)  sched_switch 
(td=0xfffff802c061f490, newtd=<value optimized out>, flags=<value 
optimized out>) at /usr/src/sys/kern/sched_ule.c:1962
   371 Thread 100598 (PID=22648: snort)  sched_switch 
(td=0xfffff80221857000, newtd=<value optimized out>, flags=<value 
optimized out>) at /usr/src/sys/kern/sched_ule.c:1962
thread 372
[Switching to thread 372 (Thread 100602)]#0  sched_switch 
(td=0xfffff802c061f490, newtd=<value optimized out>, flags=<value 
optimized out>) at /usr/src/sys/kern/sched_ule.c:1962
1962    in /usr/src/sys/kern/sched_ule.c
bt
#0  sched_switch (td=0xfffff802c061f490, newtd=<value optimized out>, 
flags=<value optimized out>) at /usr/src/sys/kern/sched_ule.c:1962
#1  0xffffffff808b8c1e in mi_switch (flags=266, newtd=0x0) at 
/usr/src/sys/kern/kern_synch.c:494
#2  0xffffffff808c04b0 in thread_suspend_switch (td=0xfffff802c061f490) at 
/usr/src/sys/kern/kern_thread.c:883
#3  0xffffffff808c0276 in thread_single (mode=1) at 
/usr/src/sys/kern/kern_thread.c:713
#4  0xffffffff8087c1bb in exit1 (td=0xfffff802c061f490, rv=9) at 
/usr/src/sys/kern/kern_exit.c:180
#5  0xffffffff808b2faf in sigexit (td=<value optimized out>, sig=<value 
optimized out>) at /usr/src/sys/kern/kern_sig.c:2935
#6  0xffffffff808b3669 in postsig (sig=<value optimized out>) at 
/usr/src/sys/kern/kern_sig.c:2822
#7  0xffffffff808f6f57 in ast (framep=<value optimized out>) at 
/usr/src/sys/kern/subr_trap.c:271
#8  0xffffffff80c75870 in Xfast_syscall () at 
/usr/src/sys/amd64/amd64/exception.S:416
#9  0x0000000801d6f19a in ?? ()
Previous frame inner to this frame (corrupt stack?)

thread 371
[Switching to thread 371 (Thread 100598)]#0  sched_switch 
(td=0xfffff80221857000, newtd=<value optimized out>, flags=<value 
optimized out>) at /usr/src/sys/kern/sched_ule.c:1962
1962    in /usr/src/sys/kern/sched_ule.c
bt
#0  sched_switch (td=0xfffff80221857000, newtd=<value optimized out>, 
flags=<value optimized out>) at /usr/src/sys/kern/sched_ule.c:1962
#1  0xffffffff808b8c1e in mi_switch (flags=260, newtd=0x0) at 
/usr/src/sys/kern/kern_synch.c:494
#2  0xffffffff808f2e3a in sleepq_wait (wchan=0x0, pri=0) at 
/usr/src/sys/kern/subr_sleepqueue.c:620
#3  0xffffffff80864aad in _cv_wait (cvp=0xffffffff8147a500, 
lock=0xffffffff8147a480) at /usr/src/sys/kern/kern_condvar.c:139
#4  0xffffffff808fb05f in vmem_xalloc (vm=0xffffffff8147a480, size0=<value 
optimized out>, align=<value optimized out>, phase=0, nocross=<value 
optimized out>, minaddr=0, maxaddr=18446735286768857088, flags=8194, 
addrp=<value optimized out>) at /usr/src/sys/kern/subr_vmem.c:1196
#5  0xffffffff808fae6b in vmem_alloc (vm=0x0, size=0, flags=<value 
optimized out>, addrp=0xfffffe0466e1d6e8) at 
/usr/src/sys/kern/subr_vmem.c:1082
#6  0xffffffff80b0fa58 in kmem_malloc (vmem=0xffffffff8147a480, 
size=2139729920, flags=2) at /usr/src/sys/vm/vm_kern.c:314
#7  0xffffffff80b08dfb in uma_large_malloc (size=<value optimized out>, 
wait=2) at /usr/src/sys/vm/uma_core.c:1006
#8  0xffffffff80898cf3 in malloc (size=2139729920, mtp=0xffffffff813a0450, 
flags=0) at /usr/src/sys/kern/kern_malloc.c:520
#9  0xffffffff8096307b in bpf_buffer_ioctl_sblen (d=0xfffff80159ea9000, 
i=<value optimized out>) at /usr/src/sys/net/bpf_buffer.c:183
#10 0xffffffff80960a3c in bpfioctl (dev=0x0, cmd=<value optimized out>, 
addr=0xfffff801fbd06b40 "", flags=0, td=0xfffff80221857000) at 
/usr/src/sys/net/bpf.c:408
#11 0xffffffff807ac1df in devfs_ioctl_f (fp=0xfffff8002b3d9d20, 
com=3221504614, data=0xfffff801fbd06b40, cred=<value optimized out>, 
td=0xfffff80221857000) at /usr/src/sys/fs/devfs/devfs_vnops.c:757
#12 0xffffffff808fdfae in kern_ioctl (td=0xfffff80221857000, fd=<value 
optimized out>, com=0) at file.h:319
#13 0xffffffff808fdd2f in sys_ioctl (td=0xfffff80221857000, 
uap=0xfffffe0466e1da40) at /usr/src/sys/kern/sys_generic.c:702
#14 0xffffffff80c8f117 in amd64_syscall (td=0xfffff80221857000, traced=0) 
at subr_syscall.c:134
#15 0xffffffff80c7580b in Xfast_syscall () at 
/usr/src/sys/amd64/amd64/exception.S:391
#16 0x0000000801d8f08a in ?? ()
Previous frame inner to this frame (corrupt stack?)

Let me know if I can debug this any further.

/Elof

On Thu, 9 Oct 2014, John-Mark Gurney wrote:

> elof2 at sentor.se wrote this message on Wed, Oct 08, 2014 at 13:30 +0200:
>>
>> I guess this is a bug report for FreeBSD 10.0.
>>
>>
>>
>> Sometimes I can't kill my snort process on FreeBSD 10.0.
>> It won't die, even with kill -9.
>>
>> I'm not talking about a zombie process. Snort is a process that should
>> die normally.
>> I've run snort on over 100 nodes since FreeBSD v6.x and I've never seen
>> this behavior until now in FreeBSD 10.0.
>>
>>
>> Example:
>>
>> #ps faxuw
>> USER      PID  %CPU %MEM    VSZ    RSS TT  STAT STARTED        TIME
>> COMMAND
>> root    49222  53.4  2.2 492648 183012  -  Rs   11:46AM     7:05.59
>> /usr/local/bin/snort -q -D -c snort.conf
>> root    47937   0.0  2.2 488552 182864  -  Ts   10:56AM    29:35.98
>> /usr/local/bin/snort -q -D -c snort.conf
>
> What is the MWCHAN?  add l to the ps command...
>
>> The pid 47937 has been killed (repeatedly) with -9.
>> Its status is "Ts" meaning it is Stopped.
>
> have you tried to kill -CONT <pid> to resume it?
>
>> But it won't actually die and disappear. The only way to get rid of it
>> seem to be to reboot the machine. :-(
>>
>> (pid 49222 is the new process that was started after 47937 was killed)
>>
>>
>> The problem doesn't happen all the time and I haven't found any patterns
>> as to when it does. :-(
>> If I restart snort once every day, it fails to die approximately 2-4 times
>> per month.
>> Even though the problem doesn't happen on every kill, it is a definately a
>> recurring event.
>
> Can you run kgdb on the machine? (yes, it works on a live machine), use
> info threads to find the thread id, and then use thread <threadid> to
> switch to it, and run bt to get a back trace...
>
>> I began to see it on a heavily loaded 10GE sensor, so I thought it could
>> have something to do with the ix driver, or the heavy load.
>> But now another FreeBSD 10.0-sensor had the exact same problem, and this
>> sensor don't have any 10GE NICs. In fact, this sensor has been running
>> just fine with both FreeBSD 9.1 and 9.3 for the past years. Snort has
>> always terminated correctly! After I reinstalled this machine with FreeBSD
>> 10.0 last friday, snort has then terminated correctly every day until
>> today, when it failed with the above pid 47937. (this sensor use the 'em'
>> driver, not 'ixgbe')
>>
>> I'm running snort with the same configuration, settings, version, daq,
>> libs, etc on 10.0 as I do on 9.3.
>> None of the 9.3 sensors have this problem, so it has to be something new
>> in FreeBSD 10.0.
>
> --
>  John-Mark Gurney				Voice: +1 415 225 5579
>
>     "All that I will do, has been done, All that I have, has not."
>