Examining the VM splay tree effectiveness

Thu Sep 30 16:37:20 UTC 2010

Just for the kick of it I decided to take a closer look at the use of
splay trees (inherited from Mach if I read the history correctly) in
the FreeBSD VM system suspecting an interesting journey.

The VM system has two major structures:
  1) the VM map which is per process and manages its address space
     by tracking all regions that are allocated and their permissions.
  2) the VM page table which is per VM map and provides the backing
     memory pages to the regions and interfaces with the pmap layer
     (virtual->physical).

Both of these are very frequently accessed through memory allocations
from userspace and page faults from the pmap layer.  Their efficiency
and SMP scalability is crucial for many operations beginning with fork/
exec, going through malloc/mmap and ending with free/exit of a process.

Both the vmmap and page table make use of splay trees to manage the
entries and to speed up lookups compared to long to traverse linked
lists or more memory expensive hash tables.  Some structures though
do have an additional linked list to simplify ordered traversals.

A splay tree is an interesting binary search tree with insertion,
lookup and removal performed in O(log n) *amortized* time.  With
the *amortized* time being the crucial difference to other binary trees.
On every access *including* lookup it rotates the tree to make the
just found element the new root node.  For all gory details see:
  http://en.wikipedia.org/wiki/Splay_tree

This behavior has a few important implications:
  1) the root node and constant rotation works like a cache with
     least recent looked up node at the top and less recently ones
     close to the top;
  2) every lookup that doesn't match the root node, ie. a cache miss,
     causes a rotation of the tree to make the newly found node the new
     root;
  3) during the rotate it has to re-arrange and touch possibly a large
     number of other nodes;
  4) in worst case the tree may resemble a linked list.  A splay tree
     makes no guarantee that it is balanced.

For the kernel and the splay tree usage in the VM map and page table
some more implications come up:
  a) the amortization effect/cost only balance if the majority of
     lookups are root node (cache) hits, otherwise the cost of
     rebalancing destroys any advantage and quickly turns into a
     large overhead.
  b) the overhead becomes even worse due to touching many nodes and
     causing a lot of CPU cache line dirtying.  For processes with
     shared memory or threads across CPU's this overhead becomes
     almost excessive because the CPU's stomp on each others cached
     nodes and get their own caches invalidated.  The penalty is a
     high-latency memory refetch not only for the root node lookup
     but every hop in the following rebalancing operation => thrashing.

To quantify the positive and negative effects of the splay tree I
instrumented the code and added counters to track the behavior of
the splay tree in the vmmap and page table.

The test case is an AMD64 kernel build to get a first overview.
Other system usages may have different results depending on their
fork and memory usage patters.

The results are very interesting:

  The page table shows a *very* poor root node hit rate and an excessive
  amount of rotational node modifications representing almost the
  worst case for splay trees.

http://chart.apis.google.com/chart?chs=700x300&chbh=a,23&chds=0,127484769&chd=t:16946915,16719791,48872230,131057,74858589,105206121&cht=bvg&chl=insert|remove|lookup|cachehit|rotates|rotatemodifies

  The vmmap shows a high root node hit rate but still a significant
  amount of rotational node modifications.

http://chart.apis.google.com/chart?chs=700x300&chbh=a,23&chds=0,21881583&chd=t:724293,723701,20881583,19044544,3719582,4553350&cht=bvg&chl=insert|remove|lookup|cachehit|rotates|rotatemodifies

 From these observations I come to the conclusion that a splay tree
is suboptimal for the normal case and quickly horrible even on small
deviations from the normal case in the vmmap.  For the page table it
is always horrible.  Not to mention the locking complications due to
modifying the tree on lookups.

Based on an examination of other binary trees I put forward the
hypothesis that a Red-Black tree is a much better, if not the optimal,
fit for the vmmap and page table.  RB trees are balanced binary trees
and O(log n) in all cases.  The big advantage in this context is that
lookups are pure reads and do not cause CPU cache invalidations on
other CPU's and always only require a read lock without the worst
case properties of the unbalanced splay tree.  The high cache locality
of the vmmap lookups can be used with the RB tree as well by simply
adding a pointer to the least recently found node.  To prevent write 
locking this can be done lazily.  More profiling is required to make
a non-speculative statement on this though.  In addition a few of
the additional linked lists that currently accompany the vmmap and
page structures are no longer necessary as they easily can be done
with standard RB tree accessors.  Our standard userspace jemalloc
also uses RB trees for its internal housekeeping.  RB tree details:
  http://en.wikipedia.org/wiki/Red-black_tree

I say hypothesis because I haven't measured the difference to an
RB tree implementation yet.  I've hacked up a crude and somewhat
mechanical patch to convert the vmmap and page VM structures to
use RB trees, the vmmap part is not stable yet.  The page part
seems to work fine though.

This is what I've hacked together so far:
  http://people.freebsd.org/~andre/vmmap_vmpage_stats-20100930.diff
  http://people.freebsd.org/~andre/vmmap_vmpage_rbtree-20100930.diff

The diffs are still in their early stages and do not make use of
any code simplifications becoming possible by using RB trees instead
of the current splay trees.

Comments on the VM issue and splay vs. RB tree hypothesis welcome.

-- 
Andre