[Bug 200493] Killing pid 11 (idle) wedges the disk IO

Fri May 29 14:32:41 UTC 2015

On Fri, 29 May 2015 bugzilla-noreply at freebsd.org wrote:

> https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=200493
>
> --- Comment #5 from Konstantin Belousov <kib at FreeBSD.org> ---
> (In reply to Edward Tomasz Napierala from comment #4)
> Great, you nailed it down.
>
> Look at the tdsigwakeup(), which is called from the main code to send a signal
> to a thread tdsendsignal():
>
>    /*
>     * Bring the priority of a thread up if we want it to get
>     * killed in this lifetime.
>     */
>    if (action == SIG_DFL && (prop & SA_KILL) && td->td_priority > PUSER)
>        sched_prio(td, PUSER);

That's a funny bug.

But why would the priority even be looked at for CPU idle threads?
idprio 31 idle threads have the same priority as CPU idle threads at
their unclobbered priorities.  The CPU idle threads should never run
in preference to others.

Bumping the priority to PUSER after a signal seems wrong for all sorts
of idle threads.  For non-timesharing threads, it is hard to get back
to the original priority.

I fixed priority initialization for device polling and noticed some
vaguely related bugs.  The idlepoll thread couldn't even get forward
to its correct priority.  (I don't believe in device polling, but it
once gave be lower network latency and I tried to recover that.)

This patch is for FreeBSD-9.

X diff -c2 kern_poll.c~ kern_poll.c
X *** kern_poll.c~	Wed Aug  6 20:13:13 2014
X --- kern_poll.c	Fri May  8 01:45:11 2015
X ***************
X *** 30,33 ****
X --- 30,34 ----
X 
X   #include "opt_device_polling.h"
X + #include "opt_sched.h"
X 
X   #include <sys/param.h>
X ***************
X *** 38,45 ****
X --- 39,48 ----
X   #include <sys/eventhandler.h>
X   #include <sys/resourcevar.h>
X + #include <sys/sched.h>
X   #include <sys/socket.h>			/* needed by net/if.h		*/
X   #include <sys/sockio.h>
X   #include <sys/sysctl.h>
X   #include <sys/syslog.h>
X + #include <sys/unistd.h>
X 
X   #include <net/if.h>			/* for IFF_* flags		*/
X ***************
X *** 534,555 ****
X 
X   static void
X ! poll_idle(void)
X   {
X - 	struct thread *td = curthread;
X - 	struct rtprio rtp;
X - 
X - 	rtp.prio = RTP_PRIO_MAX;	/* lowest priority */
X - 	rtp.type = RTP_PRIO_IDLE;
X - 	PROC_SLOCK(td->td_proc);
X - 	rtp_to_pri(&rtp, td);
X - 	PROC_SUNLOCK(td->td_proc);

This initialization was was not good originally, and has rotted.  It
depends rtp_to_pri() not only converting the rtp to td_priority, but
also on the side effect of setting the new priority.  This side effect
no longer works here.  rtp_to_prio() and its side effects are used for
syscalls too.  For syscalls, it is fragile but still works (perhaps
delayed).

The locking here has rotted.  rtp_to_pri() is no longer locked by the
spinlock that is aquired above.  It is locked by a non-spinlock
that is not acquired above.

X 
X   	for (;;) {
X   		if (poll_in_idle_loop && poll_handlers > 0) {
X   			idlepoll_sleeping = 0;
X   			ether_poll(poll_each_burst);
X ! 			thread_lock(td);
X   			mi_switch(SW_VOL, NULL);
X ! 			thread_unlock(td);
X   		} else {
X   			idlepoll_sleeping = 1;
X --- 537,559 ----
X 
X   static void
X ! poll_idle(void __unused *arg)
X   {
X 
X + #define td curthread
X + 	printf("poll_idle: pri %d upri %d baseupri %d lendupri %d\n",
X + 	    td->td_priority, td->td_user_pri,
X + 	    td->td_base_user_pri, td->td_lend_user_pri);
X + #undef td

Debugging code.  td_lend_user_pri seems to be incorrectly initialized.
I think it should equal the active priority (255 for idprio 31 threads
like this one).  But it is initialized to something like 120 (the
base user priority) somewhere.  This isn't a problem provided it is
never used.  But IIRC, when the above calls sched_prio() to try to
change its priority to 255, the priority sticks at the "lent" priority
of 120 although lending has never occurred.

X   	for (;;) {
X   		if (poll_in_idle_loop && poll_handlers > 0) {
X   			idlepoll_sleeping = 0;
X   			ether_poll(poll_each_burst);
X ! #ifdef PREEMPTION
X ! 			cpu_spinwait();
X ! #else
X ! 			thread_lock(curthread);
X   			mi_switch(SW_VOL, NULL);
X ! 			thread_unlock(curthread);
X ! #endif
X   		} else {
X   			idlepoll_sleeping = 1;

Spinloops should sleep a bit if possible to avoid wasting power.
With PREEMPTION, there is no need to yield, so we can sleep instead.
Without PREEMPTION, I think sleeping a bit wouldn't be useful since
most of the time is spent yielding.

X ***************
X *** 559,568 ****
X   }
X 
X ! static struct proc *idlepoll;
X ! static struct kproc_desc idlepoll_kp = {
X ! 	 "idlepoll",
X ! 	 poll_idle,
X ! 	 &idlepoll
X ! };
X ! SYSINIT(idlepoll, SI_SUB_KTHREAD_VM, SI_ORDER_ANY, kproc_start,
X !     &idlepoll_kp);
X --- 563,602 ----
X   }
X 
X ! static void
X ! poll_idle_start(void __unused *arg)
X ! {
X ! 	struct proc *p;
X ! 	struct thread *td;
X ! 	int error;
X ! 
X ! 	error = kproc_create(poll_idle, NULL, &p, RFSTOPPED, 0, "idlepoll");

The fix is to copy this and the following code (except the debugging
parts) from the vm idlezero thread.  That thread is currently useless,
but it always did idle priority initialization better.

Idle priority thread initialization requires too much duplictation.
kproc_start (used in the buggy version) is not useful for this,
since sched_prio() at the start of the thread doesn't work and
TDF_NOLOAD inside a thread is much further from working.

X ! 	if (error)
X ! 		panic("poll_idle_start: error %d\n", error);
X ! 	td = FIRST_THREAD_IN_PROC(p);
X ! 	thread_lock(td);
X ! 	printf("poll_idle_start 0: pri %d upri %d baseupri %d lendupri %d\n",
X ! 	    td->td_priority, td->td_user_pri,
X ! 	    td->td_base_user_pri, td->td_lend_user_pri);

More debugging code for td->td_lend_user_pri.  I had this in
rtp_to_pri() and haven't looked at its output here.  In rtp_to_pri(),
it is called for syscalls.  The interaction of the 4 priorities
printed is confusing.

X ! 	/*
X ! 	 * XXX TDF_NOLOAD should be set automatically for kernel idle
X ! 	 * threads, and probably also for user idle threads, and possibly
X ! 	 * for many other kernel and user non-timesharing threads, since
X ! 	 * the load average is basically a SCHED_4BSD implementation
X ! 	 * detail for scheduling only timesharing threads.  Having to
X ! 	 * set TDF_NOLOAD for ourself is especially inconvenient since
X ! 	 * setting it in a running thread just breaks decrementing the
X ! 	 * load count when then thread stops running.
X ! 	 */
X ! 	td->td_flags |= TDF_NOLOAD;

This thread didn't set TDF_NOLOAD, so the load average was distorted.

X ! 	sched_class(td, PRI_IDLE);
X ! 	sched_prio(td, PRI_MAX_IDLE);
X ! 	printf("poll_idle_start 1: pri %d upri %d baseupri %d lendupri %d\n",
X ! 	    td->td_priority, td->td_user_pri,
X ! 	    td->td_base_user_pri, td->td_lend_user_pri);
X ! 	sched_add(td, SRQ_BORING);
X ! 	printf("poll_idle_start 2: pri %d upri %d baseupri %d lendupri %d\n",
X ! 	    td->td_priority, td->td_user_pri,
X ! 	    td->td_base_user_pri, td->td_lend_user_pri);
X ! 	thread_unlock(td);
X ! }
X ! SYSINIT(poll_idle, SI_SUB_KTHREAD_VM, SI_ORDER_ANY, poll_idle_start, NULL);

Bruce