Examining the VM splay tree effectiveness

Thu Sep 30 17:41:55 UTC 2010

are you aware of Summer of Code 2008 project by Mayur Shardul?

quoting: http://www.freebsd.org/projects/summerofcode-2008.html

Project: VM Algorithm Improvement
Student: Mayur Shardul
Mentor: Jeff Roberson
Summary:
A new data structure, viz. radix tree, was implemented and used for management of the resident pages. The
objective is efficient use of memory and faster performance. The biggest challenge was to service insert requests
on the data structure without blocking. Because of this constraint the memory allocation failures were not
acceptable, to solve the problem the required memory was allocated at the boot time. Both the data structures were
used in parallel to check the correctness and we also benchmarked the data structures and found that radix trees
gave much better performance over splay trees.

Ready to enter CVS/SVN: We will investigate some more approaches to handle allocation failures before the new data
structure goes in CVS.

On Thu, Sep 30, 2010 at 06:37:38PM +0200, Andre Oppermann wrote:
> Just for the kick of it I decided to take a closer look at the use of
> splay trees (inherited from Mach if I read the history correctly) in
> the FreeBSD VM system suspecting an interesting journey.
> 
> The VM system has two major structures:
>  1) the VM map which is per process and manages its address space
>     by tracking all regions that are allocated and their permissions.
>  2) the VM page table which is per VM map and provides the backing
>     memory pages to the regions and interfaces with the pmap layer
>     (virtual->physical).
> 
> Both of these are very frequently accessed through memory allocations
> from userspace and page faults from the pmap layer.  Their efficiency
> and SMP scalability is crucial for many operations beginning with fork/
> exec, going through malloc/mmap and ending with free/exit of a process.
> 
> Both the vmmap and page table make use of splay trees to manage the
> entries and to speed up lookups compared to long to traverse linked
> lists or more memory expensive hash tables.  Some structures though
> do have an additional linked list to simplify ordered traversals.
> 
> A splay tree is an interesting binary search tree with insertion,
> lookup and removal performed in O(log n) *amortized* time.  With
> the *amortized* time being the crucial difference to other binary trees.
> On every access *including* lookup it rotates the tree to make the
> just found element the new root node.  For all gory details see:
>  http://en.wikipedia.org/wiki/Splay_tree
> 
> This behavior has a few important implications:
>  1) the root node and constant rotation works like a cache with
>     least recent looked up node at the top and less recently ones
>     close to the top;
>  2) every lookup that doesn't match the root node, ie. a cache miss,
>     causes a rotation of the tree to make the newly found node the new
>     root;
>  3) during the rotate it has to re-arrange and touch possibly a large
>     number of other nodes;
>  4) in worst case the tree may resemble a linked list.  A splay tree
>     makes no guarantee that it is balanced.
> 
> For the kernel and the splay tree usage in the VM map and page table
> some more implications come up:
>  a) the amortization effect/cost only balance if the majority of
>     lookups are root node (cache) hits, otherwise the cost of
>     rebalancing destroys any advantage and quickly turns into a
>     large overhead.
>  b) the overhead becomes even worse due to touching many nodes and
>     causing a lot of CPU cache line dirtying.  For processes with
>     shared memory or threads across CPU's this overhead becomes
>     almost excessive because the CPU's stomp on each others cached
>     nodes and get their own caches invalidated.  The penalty is a
>     high-latency memory refetch not only for the root node lookup
>     but every hop in the following rebalancing operation => thrashing.
> 
> To quantify the positive and negative effects of the splay tree I
> instrumented the code and added counters to track the behavior of
> the splay tree in the vmmap and page table.
> 
> The test case is an AMD64 kernel build to get a first overview.
> Other system usages may have different results depending on their
> fork and memory usage patters.
> 
> The results are very interesting:
> 
>  The page table shows a *very* poor root node hit rate and an excessive
>  amount of rotational node modifications representing almost the
>  worst case for splay trees.
> 
> http://chart.apis.google.com/chart?chs=700x300&chbh=a,23&chds=0,127484769&chd=t:16946915,16719791,48872230,131057,74858589,105206121&cht=bvg&chl=insert|remove|lookup|cachehit|rotates|rotatemodifies
> 
>  The vmmap shows a high root node hit rate but still a significant
>  amount of rotational node modifications.
> 
> http://chart.apis.google.com/chart?chs=700x300&chbh=a,23&chds=0,21881583&chd=t:724293,723701,20881583,19044544,3719582,4553350&cht=bvg&chl=insert|remove|lookup|cachehit|rotates|rotatemodifies
> 
> From these observations I come to the conclusion that a splay tree
> is suboptimal for the normal case and quickly horrible even on small
> deviations from the normal case in the vmmap.  For the page table it
> is always horrible.  Not to mention the locking complications due to
> modifying the tree on lookups.
> 
> Based on an examination of other binary trees I put forward the
> hypothesis that a Red-Black tree is a much better, if not the optimal,
> fit for the vmmap and page table.  RB trees are balanced binary trees
> and O(log n) in all cases.  The big advantage in this context is that
> lookups are pure reads and do not cause CPU cache invalidations on
> other CPU's and always only require a read lock without the worst
> case properties of the unbalanced splay tree.  The high cache locality
> of the vmmap lookups can be used with the RB tree as well by simply
> adding a pointer to the least recently found node.  To prevent write 
> locking this can be done lazily.  More profiling is required to make
> a non-speculative statement on this though.  In addition a few of
> the additional linked lists that currently accompany the vmmap and
> page structures are no longer necessary as they easily can be done
> with standard RB tree accessors.  Our standard userspace jemalloc
> also uses RB trees for its internal housekeeping.  RB tree details:
>  http://en.wikipedia.org/wiki/Red-black_tree
> 
> I say hypothesis because I haven't measured the difference to an
> RB tree implementation yet.  I've hacked up a crude and somewhat
> mechanical patch to convert the vmmap and page VM structures to
> use RB trees, the vmmap part is not stable yet.  The page part
> seems to work fine though.
> 
> This is what I've hacked together so far:
>  http://people.freebsd.org/~andre/vmmap_vmpage_stats-20100930.diff
>  http://people.freebsd.org/~andre/vmmap_vmpage_rbtree-20100930.diff
> 
> The diffs are still in their early stages and do not make use of
> any code simplifications becoming possible by using RB trees instead
> of the current splay trees.
> 
> Comments on the VM issue and splay vs. RB tree hypothesis welcome.
> 
> -- 
> Andre
> _______________________________________________
> freebsd-current at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-current
> To unsubscribe, send any mail to "freebsd-current-unsubscribe at freebsd.org"