10.2 - Process stuck in unkillable sleep

Wed Feb 24 13:37:40 UTC 2016

On 2016-02-24 08:18, Konstantin Belousov wrote:
> On Wed, Feb 24, 2016 at 02:26:19PM +1000, Paul Koch wrote:
>> 
>> Occasionally we see a process get stuck in an unkillable state and
>> the only solution is a hard reboot.
>> 
>> Occasionally == once every two weeks across 60+ servers, which are 
>> spread
>> across the globe in customer sites.  We have no remote access to these 
>> boxes.
>> 
>> The process that most often that gets stuck, but not limited to, is a 
>> large
>> scale Ping/SNMP poller.  It is a fairly simplistic C program that just 
>> fires
>> out lots of ping (raw ICMP socket) and SNMP (UDP socket) requests
>> asynchronously.
>> 
>> We've managed to trap the problem a few times on a test server running 
>> in
>> VirtualBox, but it also occurs on customer sites who run VMware, 
>> Hyper-V,
>> QEMU and on bare metal.
>> 
>> 
>> We raise this PR
>>  https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=204081
>> 
>> but suspect it is a similar/same issue as
>>  https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=200992
>> 
>> This is the info we've gathered from the most recent time it has 
>> occurred:
>> 
>> 
>> # uname -a
>> FreeBSD shed153.akips.com 10.2-RELEASE-p12 FreeBSD 10.2-RELEASE-p12 #0 
>> r295070:
>> Sat Jan 30 20:03:44 UTC 2016  
>> root at shed21.akips.com:/usr/obj/usr/src/sys/GENERIC amd64
> 
>> # ps auxww | grep nm-poller
>> akips    1014   0.0  2.6 871820 106540  -  Ds   10Feb16  1078:59.06 
>> nm-poller
>> 
>> 
>> # procstat -k 1014
>>   PID    TID COMM       TDNAME   KSTACK
>>  1014 100365 nm-poller  -        mi_switch sleepq_timedwait_sig 
>> _cv_timedwait_sig_sbt seltdwait kern_select sys_select amd64_syscall 
>> Xfast_syscall
>> 
> 
> Yes, on HEAD it was reported that the https://reviews.freebsd.org/D5221
> fixed the problem.  Still not reviewed.
> 
> I did back-port to stable/10, the patch below is probably not 
> applicable
> to 10.2, you would need 10.3 for it.  Some revisions are missed from
> stable/10, but I think that the issue worked around in the patch is at
> the core of troubles many people reported.
> 
> Index: sys/kern/kern_timeout.c
> ===================================================================
> --- sys/kern/kern_timeout.c	(revision 295966)
> +++ sys/kern/kern_timeout.c	(working copy)
> @@ -1127,7 +1127,7 @@ _callout_stop_safe(c, safe)
>  	 * Some old subsystems don't hold Giant while running a 
> callout_stop(),
>  	 * so just discard this check for the moment.
>  	 */
> -	if (!safe && c->c_lock != NULL) {
> +	if ((safe & CS_DRAIN) == 0 && c->c_lock != NULL) {
>  		if (c->c_lock == &Giant.lock_object)
>  			use_lock = mtx_owned(&Giant);
>  		else {
> @@ -1207,7 +1207,7 @@ again:
>  			return (0);
>  		}
> 
> -		if (safe) {
> +		if ((safe & CS_DRAIN) != 0) {
>  			/*
>  			 * The current callout is running (or just
>  			 * about to run) and blocking is allowed, so
> @@ -1319,7 +1319,7 @@ again:
>  			CTR3(KTR_CALLOUT, "postponing stop %p func %p arg %p",
>  			    c, c->c_func, c->c_arg);
>  			CC_UNLOCK(cc);
> -			return (0);
> +			return ((safe & CS_MIGRBLOCK) != 0);
>  		}
>  		CTR3(KTR_CALLOUT, "failed to stop %p func %p arg %p",
>  		    c, c->c_func, c->c_arg);
> Index: sys/kern/subr_sleepqueue.c
> ===================================================================
> --- sys/kern/subr_sleepqueue.c	(revision 295966)
> +++ sys/kern/subr_sleepqueue.c	(working copy)
> @@ -572,7 +572,8 @@ sleepq_check_timeout(void)
>  	 * another CPU, so synchronize with it to avoid having it
>  	 * accidentally wake up a subsequent sleep.
>  	 */
> -	else if (callout_stop(&td->td_slpcallout) == 0) {
> +	else if (_callout_stop_safe(&td->td_slpcallout, CS_MIGRBLOCK)
> +	    == 0) {
>  		td->td_flags |= TDF_TIMEOUT;
>  		TD_SET_SLEEPING(td);
>  		mi_switch(SW_INVOL | SWT_SLEEPQTIMO, NULL);
> Index: sys/sys/callout.h
> ===================================================================
> --- sys/sys/callout.h	(revision 295966)
> +++ sys/sys/callout.h	(working copy)
> @@ -62,6 +62,9 @@ struct callout_handle {
>  	struct callout *callout;
>  };
> 
> +#define	CS_DRAIN		0x0001
> +#define	CS_MIGRBLOCK		0x0002
> +
>  #ifdef _KERNEL
>  /*
>   * Note the flags field is actually *two* fields. The c_flags
> @@ -81,7 +84,7 @@ struct callout_handle {
>   */
>  #define	callout_active(c)	((c)->c_flags & CALLOUT_ACTIVE)
>  #define	callout_deactivate(c)	((c)->c_flags &= ~CALLOUT_ACTIVE)
> -#define	callout_drain(c)	_callout_stop_safe(c, 1)
> +#define	callout_drain(c)	_callout_stop_safe(c, CS_DRAIN)
>  void	callout_init(struct callout *, int);
>  void	_callout_init_lock(struct callout *, struct lock_object *, int);
>  #define	callout_init_mtx(c, mtx, flags)					\
> 
> _______________________________________________
> freebsd-stable at freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to 
> "freebsd-stable-unsubscribe at freebsd.org"

I'm not sure if I have the same of different issue. According to top my 
process is stuck in "STOP" state.

FreeBSD firewall.mikej.com 10.2-STABLE FreeBSD 10.2-STABLE #22 r289078M: 
Wed Dec  9 17:13:31 EST 2015     
mikej at firewall.mikej.com:/usr/obj/usr/src/sys/GENERIC  amd64

42152 emby          2  20  -20   869M  1424K STOP    4 166:22   0.00% 
mono-sgen

root at firewall:/usr/ports/devel # procstat -kk 42152
   PID    TID COMM             TDNAME           KSTACK
42152 101501 mono-sgen        -                mi_switch+0xe1 
thread_suspend_switch+0x170 thread_single+0x4e5 exit1+0xbe sigexit+0x925 
postsig+0x286 ast+0x427 doreti_ast+0x1f
42152 101511 mono-sgen        -                mi_switch+0xe1 
sleepq_timedwait_sig+0x8b _sleep+0x238 umtxq_sleep+0x125 do_wait+0x387 
__umtx_op_wait_uint_private+0x83 amd64_syscall+0x35d Xfast_syscall+0xfb
root at firewall:/usr/ports/devel #

kill -9 42152 has no affect.

I tried to stop the process with /usr/local/etc/rc.d/emby-server stop

emby-server-3.0.5821
mono-4.2.2.10

If this is different issue please let me know and I will open a separate 
PR.

Thank you.

--mikej