Found the issue! - SCHED_ULE+PREEMPTION is the problem

Tue Apr 10 19:13:22 UTC 2018

Results:

1. The tdq_ridx pointer

The perceived slow advance (of the tdq_ridx pointer into the circular 
array) is correct behaviour. McKusick writes:

>The pointer is advanced once per system tick, although it may not
>advance on a tick until the currently selected queue is empty. Since
>each thread is given a maximum time slice and no threads may be added
>to the current position, the queue will drain in a bounded amount of
>time.

Therefore, it is also normal that the process (the piglet in this case) 
does run until it's time slice (aka quantum) is used up.

2. The influence of preempt_thresh

This can be found in tdq_runq_add(). A simplified description of the 
logic there is as follows:

td_priority <  152 ?	-> add to realtime-queue
td_priority <= 223 ?	-> add to timeshare-queue
    if preempted
        circular-index = tdq_ridx
    else
        circular_index = tdq_idx + td_priority
else			-> add to idle-queue

If the thread had been preempted, it is reinserted at the current 
working position of the circular array, otherwise the position is 
calculated from thread priority.

3. The quorum

Most of the task switches come from device interrupts. Those are running 
at priority intr:8 or intr:12. So, as soon as preempt_thresh is 12 or 
bigger, the piglet is almost always reinserted in the runqueue due to 
preemption.
And, as we see, in that case we do not have a scheduling, we have a 
simple resume!

A real scheduling happens only after the quorum is exhausted. Therefore,
reducing the quorum helps.

4. History

In r171713 was this behaviour deliberately introduced.

In r220198 it was fixed, with a focus on CPU-hogs and single-CPU.

In r239157 the fix was undone due to performance considerations, with 
the focus on rescheduling only at end of the time-slice.

5. Conclusion

The current defaults seem not very well suited for certain CPU-intense 
tasks. Possible solutions are one of:
  * not use SCHED_ULE
  * not use preemption
  * change kern.sched.quorum to minimal value.

P.