Heavy I/O blocks FreeBSD box for several seconds

Mon Aug 1 12:53:35 UTC 2011

I've looked at this issue some more and it appears to me that sched_pickcpu() is
never called for CPU-bound threads that use up their slices completely and never
do any voluntary switches.  A result of this is that the threads stay on their
initial CPU until re-balance mechanism kicks in.

There also seem to be some peculiarities that adversely affect the rebalance
algorithm and potentially other CPU selection/affinity code.

1.
For single-socket SMP systems our current sched topology code builds a topology
with a redundant layer.  Here is a real example for a single-socket dual-core
Core2 Duo system:
<groups>
 <group level="1" cache-level="0">
  <cpu count="2" mask="0x3">0, 1</cpu>
  <children>
   <group level="2" cache-level="2">
    <cpu count="2" mask="0x3">0, 1</cpu>
   </group>
  </children>
 </group>
</groups>
As we can see both layers contain exactly the same set of CPUs.
I will try to demonstrate below that having such a topology is not only redundant,
but also harmful.

2.
The algorithm for searching for CPUs with highest/lowest load within a CPU group
is biased towards CPUs with lower IDs.  E.g. if a CPU group contains CPUs with IDs
0..3 and all the CPUs have equal load, then CPU #0 is always selected.  This
happens as an obvious result of iterating from lower CPU IDs towards higher ones
and selecting lowest loaded CPU by "this CPU load" < "lowest load so far" condition.

It's also important to note that the algorithm above also takes into account a
structure of a group being searched.  E.g. when searching for a lowest loaded CPU
in a group that has two subgroups, then we actually search for a lowest loaded CPU
in a lowest loaded sub-group which may not be a CPU that has the lowest load overall.
Say, we have 4 CPUs in a group and the group is sub-divided into two sub-groups
with two CPUs in each.  Let CPU loads be:
5 1 | 2 3
where a number corresponds to a load of a CPU and its position corresponds to CPU
ID and the pipe symbol corresponds to a boundary between the sub-groups.
Then the algorithm would select CPU 2 with its load of 2 as the lowest loaded,
because its subgroup has a combined load of 2+3=5 which less than combined load of
the other group (5+1=6), which contains the CPU with the lowest overall load of 1.

Now let's assume the following configuration: 1 package x 2 cores x 2 HTT/SMT.
Let CPU IDs be 0 .. 3. And let there be 5 CPU-bound threads.
Our x86 topology code will create the following scheduling topology:
level 1, cachelevel 0 ("shared nothing"), CPUs 0 .. 3
 level 2, cache level 2 ("L2"), CPUs 0 .. 3
  level 3, cache level 1 ("L1"), CPUs 0, 1
  level 3, cache level 1 ("L1"), CPUs 2, 3

Let's assume initial load distribution of
1 1 | 1 2
where a number corresponds to a load of a CPU and its position corresponds to CPU
ID from 0 to 3 and the pipe symbol corresponds to a boundary between the L1
sharing groups.

Rebalancing algorithm first rebalances a top level group by moving some threads
from a highest loaded CPU to a lowest loaded CPU.  Then the procedure is repeated
for each of the child groups, then their child groups, etc, until leaf groups are
rebalanced.

So, when the rebalancing is done for the top shared nothing group we get:
2 1 | 1 1
this is because CPU 0 is found to have the lowest load (as described above) and
CPU 3 obviously has the highest load.

Then the rebalancing is done in the L2 group which contains all the same CPUs and
we get:
1 1 | 2 1
as described above, the "extra" thread is moved from the higher loaded sub-group
to the lower loaded sub-group.

Then the rebalanacing is done in each of the L1 groups.  In the first group the
load is already equal between CPUs and in the seconds group the load is moved from
one CPU to the other:
1 1 | 1 2

So in the end of the rebalancing we get exactly the same load distribution as at
the start.  The extra load is always on the same CPU.  I believe that this might
describe the problem that Steve observes.  But I am not quite sure of that without
additional objective data.
The thing is that the above reasoning demonstrates the load distribution only from
the point of view of per-CPU run-queues.  But it doesn't demonstrate how much of
CPU time each individual _thread_ gets as a result.

What would we have with the extra shared-nothing level?
Let's start with the same initial distribution:
1 1 | 1 2
After L2 rebalancing:
2 1 | 1 1
After L1 rebalancing:
1 2 | 1 1
What happens at the next rebalancing run?
After L2 rebalancing:
1 1 | 2 1
After L1 rebalancing:
1 1 | 1 2

So in this case the extra load oscillated between CPUs 1 and 3.  Better but still
not perfect.

Perhaps the things could be improved if we "tossed a coin" when comparing equal
loads.  Or maybe some load history could be taken into account.
But definitely we should not have CPU groups that contain exactly the same sets of
CPUs.
-- 
Andriy Gapon