Examining the VM splay tree effectiveness

Fri Oct 1 08:53:49 UTC 2010

On 30.09.2010 19:51, Ivan Voras wrote:
> On 09/30/10 18:37, Andre Oppermann wrote:
>
>> Both the vmmap and page table make use of splay trees to manage the
>> entries and to speed up lookups compared to long to traverse linked
>> lists or more memory expensive hash tables.  Some structures though
>> do have an additional linked list to simplify ordered traversals.
>
> The property of splay tree requiring *writes* for nearly every read
> really is a thorn in the eye for SMP. It seems to me that even if the
> immediate benefits from converting to something else are not directly
> observable, it will still be worth doing it.

Fully agreed.

> It's a shame that RCU is still a patent minefield :/
>
> http://mirror.leaseweb.com/kernel/people/npiggin/patches/lockless/2.6.16-rc5/radix-intro.pdf

I'm not convinced that RCU is the only scalable way of sharing a
data structure across a possibly large number of CPU's.

The term "lockless" is often used and frequently misunderstood.
Having a lockess data structure *doesn't* mean that is either
performant, scalable or both.  It heavily depends on a number
of additional factors.  Many times "lockless" just replaces a
simple lock/unlock cycle with a number of atomic operations on
the data structure.  This can easily backfire because an atomic
operation just hides the computational complexity and also dirties
the CPU's cache lines.  Generally on cache coherent architectures
almost all of the atomic operation is in HW with bus lock cycles,
bus snooping and whatnot.  While seemingly simple form the programmers
point of view, the overhead and latency is still there.  Needless
to say that on more relaxed architectures (SPARC, Alpha, ...) the
overhead is higher.  Also important in the overall picture are the
memory barrier semantics of locks.  Some newer CPU's have started
to provide hardware implemented lock managers and combine it with
SMT features so that access to an already locked lock causes an
immediate HW thread switch and on unlock a switch back.  We also
have rm_locks (read mostly locks) that do not require synchronization
on read but have a more expensive write lock.  In UMA we use a mix
of global pools of elements with per-CPU caching of free elements.

As always the best approach depends on the dominant access pattern
of a structure.  It all comes down to the amortization cost of the
different locking or "lockless" strategies.  It's also important
to make sure that worst case behavior doesn't bring down the box.

As a rule of thumb I use this:

  a) make sure the lock is held for only a small amount of time
     to avoid lock contention.
  b) do everything you can outside of the lock.
  c) if the lock is found to be heavily contended rethink the
     whole approach and check if other data structures can be used.
  d) minimize write accesses to memory in the lock protected
     shared data structure.
  e) PROFILE, DON'T SPECULATE! Measure the access pattern and
     measure the locking/data access strategy's cost in terms
     of CPU cycles consumed.

  f) on lookup heavy data structures avoid writing to memory and
     by it dirtying CPU caches.
  g) on modify heavy data structures avoid touching too many
     elements.
  h) on lookup and modify heavy data structure that are used
     across many CPU's all bets are off and a different data
     structure approach should be considered resulting ideally
     in case f).

It all starts with the hypothesis that a data structure is not
optimally locked.

-- 
Andre