more data: SCHED_ULE+PREEMPTION is the problem (was: kern.sched.quantum: Creepy, sadistic scheduler

Sat Apr 7 15:13:43 UTC 2018

Hi all,
  in the meantime I did some tests and found the following:

A. The Problem:
---------------
On a single CPU, there are -exactly- two processes runnable:
One is doing mostly compute without I/O - this can be a compressing job 
or similar; in the tests I used simply an endless-loop. Lets call this 
the "piglet".
The other is doing frequent file reads, but also some compute interim -
this can be a backup job traversing the FS, or a postgres VACUUM, or 
some fast compressor like lz4. Lets call this the "worker".

It then happens that the piglet gets 99% CPU, while the worker gets only 
0.5% CPU and makes nearly no progress at all.

Investigations shows that the worker makes precisely one I/O per 
timeslice (timeslice as defined in kern.sched.quantum) - or two I/O on a 
mirrored ZFS.

B. Findings:
------------
1. Filesystem

I could never reproduce this when reading from plain UFS. Only when 
reading from ZFS (direct or via l2arc).

2. Machine

The problem originally appeared on a pentium3 at 1GHz. I was able to 
reproduce it on an i5-3570T, given the following measures:
  * config in BIOS to use only one CPU
  * reduce speed: "dev.cpu.0.freq=200"
I did see the problem also when running full speed (which means it 
happens there also), but could not reproduce it well.

3. kern.sched.preempt_thresh

I could make the problem disappear by changing kern.sched.preempt_thresh 
  from the default 80 to either 11 (i5-3570T) or 7 (p3) or smaller. This 
seems to correspond to the disk interrupt threads, which run at intr:12 
(i5-3570T) or intr:8 (p3).

4. dynamic behaviour

Here the piglet is already running as PID=2119. Then we can watch the 
dynamic behaviour as follows (on i5-3570T at 200MHz):

a. with kern.sched.preempt_thresh=80

$ lz4 DATABASE_TEST_FILE /dev/null & while true; 

   do ps -o pid,pri,"%cpu",command -p 2119,$! 

   sleep 3 

done 

[1] 6073 

  PID PRI %CPU COMMAND 

6073  20  0.0 lz4 DATABASE_TEST_FILE /dev/null 

2119 100 91.0 -bash (bash) 

  PID PRI %CPU COMMAND 

6073  76 15.0 lz4 DATABASE_TEST_FILE /dev/null 

2119  95 74.5 -bash (bash) 

  PID PRI %CPU COMMAND 

6073  52 19.0 lz4 DATABASE_TEST_FILE /dev/null 

2119  94 71.5 -bash (bash) 

  PID PRI %CPU COMMAND 

6073  52 16.0 lz4 DATABASE_TEST_FILE /dev/null 

2119  95 76.5 -bash (bash) 

  PID PRI %CPU COMMAND 

6073  52 14.0 lz4 DATABASE_TEST_FILE /dev/null 

2119  96 80.0 -bash (bash) 

  PID PRI %CPU COMMAND 

6073  52 12.5 lz4 DATABASE_TEST_FILE /dev/null 

2119  96 82.5 -bash (bash) 

  PID PRI %CPU COMMAND 

6073  74 10.0 lz4 DATABASE_TEST_FILE /dev/null 

2119  98 86.5 -bash (bash) 

  PID PRI %CPU COMMAND 

6073  52  8.0 lz4 DATABASE_TEST_FILE /dev/null 

2119  98 89.0 -bash (bash) 

  PID PRI %CPU COMMAND 

6073  52  7.0 lz4 DATABASE_TEST_FILE /dev/null 

2119  98 90.5 -bash (bash) 

  PID PRI %CPU COMMAND 

6073  52  6.5 lz4 DATABASE_TEST_FILE /dev/null 

2119  99 91.5 -bash (bash)

b. with kern.sched.preempt_thresh=11

  PID PRI %CPU COMMAND 

4920  21  0.0 lz4 DATABASE_TEST_FILE /dev/null 

2119 101 93.5 -bash (bash) 

  PID PRI %CPU COMMAND 

4920  78 20.0 lz4 DATABASE_TEST_FILE /dev/null 

2119  94 70.5 -bash (bash) 

  PID PRI %CPU COMMAND 

4920  82 34.5 lz4 DATABASE_TEST_FILE /dev/null 

2119  88 54.0 -bash (bash) 

  PID PRI %CPU COMMAND 

4920  85 42.5 lz4 DATABASE_TEST_FILE /dev/null 

2119  86 45.0 -bash (bash) 

  PID PRI %CPU COMMAND 

4920  85 43.5 lz4 DATABASE_TEST_FILE /dev/null 

2119  86 44.5 -bash (bash) 

  PID PRI %CPU COMMAND 

4920  85 43.0 lz4 DATABASE_TEST_FILE /dev/null 

2119  85 45.0 -bash (bash) 

  PID PRI %CPU COMMAND 

4920  85 43.0 lz4 DATABASE_TEST_FILE /dev/null 

2119  85 45.5 -bash (bash)

 From this we can see that in case b. both processes balance out nicely 
and meet at equal CPU shares.
Whereas in case a., after about 10 Seconds (the first 3 records) they 
move to opposite ends of the scale and stay there.

 From this I might suppose that here is some kind of mis-calculation or 
mis-adjustment of the task priorities happening.

P.