Found the issue! - SCHED_ULE+PREEMPTION is the problem
pmc at citylink.dinoex.sub.org
Tue Apr 10 19:13:22 UTC 2018
1. The tdq_ridx pointer
The perceived slow advance (of the tdq_ridx pointer into the circular
array) is correct behaviour. McKusick writes:
>The pointer is advanced once per system tick, although it may not
>advance on a tick until the currently selected queue is empty. Since
>each thread is given a maximum time slice and no threads may be added
>to the current position, the queue will drain in a bounded amount of
Therefore, it is also normal that the process (the piglet in this case)
does run until it's time slice (aka quantum) is used up.
2. The influence of preempt_thresh
This can be found in tdq_runq_add(). A simplified description of the
logic there is as follows:
td_priority < 152 ? -> add to realtime-queue
td_priority <= 223 ? -> add to timeshare-queue
circular-index = tdq_ridx
circular_index = tdq_idx + td_priority
else -> add to idle-queue
If the thread had been preempted, it is reinserted at the current
working position of the circular array, otherwise the position is
calculated from thread priority.
3. The quorum
Most of the task switches come from device interrupts. Those are running
at priority intr:8 or intr:12. So, as soon as preempt_thresh is 12 or
bigger, the piglet is almost always reinserted in the runqueue due to
And, as we see, in that case we do not have a scheduling, we have a
A real scheduling happens only after the quorum is exhausted. Therefore,
reducing the quorum helps.
In r171713 was this behaviour deliberately introduced.
In r220198 it was fixed, with a focus on CPU-hogs and single-CPU.
In r239157 the fix was undone due to performance considerations, with
the focus on rescheduling only at end of the time-slice.
The current defaults seem not very well suited for certain CPU-intense
tasks. Possible solutions are one of:
* not use SCHED_ULE
* not use preemption
* change kern.sched.quorum to minimal value.
More information about the freebsd-stable