New malloc ready, take 42

Fri Dec 23 01:31:21 PST 2005

Jason Evans wrote:

> On 28 November, I announced a new malloc implementation that I've  been 
> working on as a replacement for the current libc malloc.  The  main goal 
> for this rewrite of malloc is to provide better scalability  for 
> multi-threaded programs on SMP systems, without degrading  performance 
> in general.  The benchmarks I've run indicate that the  code now meets 
> this goal.
> 
> ===
> === Resolved issues ===
> ===
> 
> Several people provided valuable feedback regarding architecture, (in) 
> correct function, and performance.  At this point, I think I have  
> addressed all outstanding issues.  Here's a recap:
> 
> * Hiten Pandya and others requested that I used the sys/tree.h  
> implementation of red-black trees rather than my own.  This is done.
> 
> * Jon Dama objected to leaving sbrk() support out of the malloc  
> implementation for 32-bit architectures.  I've added support for sbrk () 
> on i386 and arm.
> 
> * David Xu pointed out that the hand-rolled spinlocks did not do  
> priority inheritance, which could lead to priority inversion  
> deadlocks.  This has been fixed by using the libc spinlocks.
> 
> * Kris Kennaway discovered that sparc64 doesn't have TLS, and Olivier  
> Houchard contacted me about the same issue for arm.  I removed the  use 
> of TLS on those architectures.
> 
> * Claus Guttesen and Devon O'Dell reported issues in kldxref and X on  
> amd64.  The kldxref problem has been fixed.  The X problem is a bit  
> trickier.  I borrowed a Turion-based laptop from Rob Braun and indeed  
> found that X wouldn't start.  Eric Anholt suggested trying the xorg- 
> server-snap port rather than xorg-server, and the problem went away.   
> Eric explained that xorg-server uses custom ELF *.a module loading  (the 
> idea being that the same exact module could be used on different  
> operating systems), whereas xorg-server-snap uses dlopen(3) and  
> friends.  Apparently, we are on the cusp of the transition to using  the 
> latter, so for now the workaround on amd64 is to use xorg-server- snap, 
> with the expectation that soon the standard xorg-server port  will be 
> upgraded.
> 
> ===
> === Benchmarking ad nauseam ===
> ===
> 
> ==> Lever & Boreham benchmarks (multi-threaded)
> 
> Chuck Lever and David Boreham published "malloc() Performance in a  
> Multithreaded Linux Environment" in the proceedings of the FREENIX  
> track of the 2000 USENIX Annual Technical Conference.  By some  
> incredible stroke of luck, I found the source code for their  
> benchmarks, and Kris Kennaway ran the first of them on a 14-CPU  sparc64 
> system as well as a dual-dual opteron system.  (The other two  
> benchmarks aren't very enlightening.)  It's important to keep in mind  
> that this is a micro-benchmark that demonstrates a worst case  scenario 
> for non-SMP-scalable malloc implementations.  That said,  with 5 
> threads, jemalloc outperformed phkmalloc by 15X on sparc64 and  80X on 
> amd64.
> 
> ==> super-smack (multi-threaded)
> 
> The SMP scalability tests were very encouraging, but Kris also ran  some 
> super-smack benchmarks on a dual-dual opteron system, and  jemalloc was 
> actually a bit slower than phkmalloc in that case  (~2.7-3.7% slower, 
> depending on jemalloc configuration).  I've been  obsessing over that 
> for over a week now, and one of the results is  that jemalloc is now 
> more tunable -- delayed coalescence cache size,  number of allocation 
> arenas, and allocation quantum size are all  tunable via MALLOC_OPTIONS 
> now.
> 
> However, changing the tuning parameters wasn't enough to make up the  
> performance difference for the super-smack benchmarks.  So, I spent  
> some time looking at memory fragmentation.  Taking a cue from Poul- 
> Henning Kamp's malloc work, I finally wrote a program that converts  
> ktrace output to memory usage graphs.  Here's a series of of graphs  
> from various stages of the fragmentation avoidance changes I made.   The 
> test program was 'kldxref /boot/kernel' on amd64.
> 
>     http://people.freebsd.org/~jasone/jemalloc/plots/kldxref.png
>     http://people.freebsd.org/~jasone/jemalloc/plots/kldxref13.gif
>     http://people.freebsd.org/~jasone/jemalloc/plots/kldxref15.gif
> 
> The interesting thing to note is that in the first two plots, the  
> highly fragmented portion of the graph continuously grows, wheareas  in 
> the last plot, the fragmented portion stops growing about 1/3 of  the 
> way through the program run.  (The first plot only shows the  bottom 
> half of what the other two plots show.)  This is all really  cool, but 
> alas, it didn't help the super-smack results.  I was able  to use the 
> graphing program to determine that cache line sharing was  potentially 
> causing even more issues than suspected, but we already  knew that cache 
> line sharing was an issue as a result of some of the  runs that Kris did.
> 
> Finally, last night I discovered that overzealous use of __inline is  
> likely to have been the culprit.  I removed the __inline keyword from  
> all of the functions that aren't part of the critical path for small  
> memory allocation/deallocation, and measured a substantial  performance 
> improvement.
> 
> I don't have access to the machine that Kris ran the super-smack  
> benchmarks on, so I set up a couple of VMware guest systems using  
> VMware 5.5 on a 3.2 GHz P4 system (HTT enabled for the host, but not  
> the guests).  The two guests were configured very similarly, so that  
> the only pertinent difference was that one was using phkmalloc, and  the 
> other was using jemalloc.  Before the __inline changes, jemalloc  was 
> ~2.5% slower than phkmalloc.  Here are the final results, which  show 
> jemalloc to be ~4.1% faster than phkmalloc for the benchmark:
> 
> x ss_phk
> + ss_je
> +----------------------------------------------------------------------- 
> ---+
> |xxx  x       x     x  x x       x  +      x       +  ++        ++ + + 
> +   +|
> |  |_____________A__M__________|                 | 
> ___________A___M______|  |
> +----------------------------------------------------------------------- 
> ---+
>     N           Min           Max        Median           Avg         
> Stddev
> x  10       1594.94       1658.64       1623.73      1619.431      
> 21.833342
> +  10       1649.06       1706.75       1693.23      1686.275      
> 17.410025
> Difference at 95.0% confidence
>         66.844 +/- 18.5532
>         4.12762% +/- 1.14566%
>         (Student's t, pooled s = 19.7459)
> 
> The numbers reported are queries/second, so bigger is better.
> 
> I used mysql 5.0.16, super-smack 1.3, libthr, and  MALLOC_OPTIONS='aj', 
> with 10 replicates of the following command for  each VMware guest:
> 
>     super-smack -d mysql select-key.smack 4 10000
> 
> It's worth noting that, as near as I can tell, almost all memory  
> allocation happens in a single mysqld thread.  Presumably, a master  
> thread is handing buffers to slaves, which process the buffers, then  
> free them.  As such, this test doesn't benefit from jemalloc's  multiple 
> arenas, so it is not a very good test of SMP scalability, at  least with 
> regard to malloc.
> 
> ==> cfrac (single-threaded)
> 
> cfrac is one of many benchmark programs that is included in the Hoard  
> memory allocator source distribution (see http://www.cs.umass.edu/ 
> ~emery/hoard/).  I ran cfrac as follows:
> 
>     cfrac 47582602774358115722167492755475367767
> 
> (Same VMware setup as for super-smack benchmarks.)
> 
>                 Wall time (seconds)
> phkmalloc 'aj': 18.75, 18.68, 18.69
> jemalloc 'aj':  17.57, 17.65, 17.56
> 
> I haven't looked at allocation traces of cfrac in quite a while, but  
> IIRC, it does a lot of small allocations of various sizes.
> 
> ==> sh6bench (single-threaded)
> 
> sh6bench (see http://www.microquill.com/smartheap/shbench/bench.zip)  is 
> a quirky malloc benchmark that has been used in some of the  published 
> malloc literature, so I include it here.  sh6bench does  cycles of 
> allocating groups of objects, where the size of objects in  each group 
> is random.  Sometimes the objects are held for a while  before being 
> freed.   I ran sh6bench with the following interactive  runtime parameters:
> 
> call count: 50000
> min block size: 1
> max block size: 1000
> 
> phkmalloc doesn't fare very well with its default settings since the  
> benchmark's memory usage fluctuates enough to cause phkmalloc to  
> repeatedly allocate and free pages.  Therefore I report numbers for  
> multiple MALLOC_OPTIONS settings.  Since the program prompts for  
> interactive input, I report the sum of user and sys time, rather than  
> wall time.  (Same VMware setup as for super-smack benchmarks.)
> 
>                     user+sys time (seconds)
> phkmalloc 'aj':     25.66, 25.78, 22.50
> phkmalloc 'aj>>>>': 13.72, 13.77, 13.69
> jemalloc 'aj':      17.88, 17.05, 17.10
> 
> If phkmalloc's cache size is increased adequately, it beats  jemalloc.  
> jemalloc simply has to do more work when splitting and  coalescing 
> regions than phkmalloc does, and this benchmark severely  stresses that 
> aspect of jemalloc.  It's perhaps worth noting that  jemalloc's peak 
> memory usage is ~22% lower than phkmalloc's, which  means that it's 
> doing a better job of avoiding fragmentation.  At  least the extra 
> algorithmic overhead gains us something.
> 
> ==> gs (single-threaded)
> 
> Ghostscript is the well known PostScript interpreter.  I took the  
> biggest PDF I have (the PostScript Lanuguage Reference, 3rd Ed.) and  
> converted it to PostScript, then ran AFPL Ghostscript 8.53 as follows:
> 
>     gs -dBATCH -dNODISPLAY PS3.ps
> 
> As with sh6bench, memory usage fluctuated enough to cause excessive  
> allocation/deallocation of pages with phkmalloc, so I report times  for 
> multiple MALLOC_OPTIONS settings.  (Same VMware setup as for  
> super-smack benchmarks.)
> 
>                     Wall time (seconds)
> phkmalloc 'aj':     92.12, 92.03, 92.90
> phkmalloc 'aj>>>>': 61.11, 61.29, 61.73
> jemalloc  'aj':     62.34, 62.43, 62.20
> jemalloc  'ajQQQc': 60.85, 60.48, 60.81
> 
> Okay, the 'ajQQQc' parameterization for jemalloc is a bit silly, but  if 
> phkmalloc gets help, then so should jemalloc. =)
> 
> ===
> === Proposed approach for inclusion in CURRENT ===
> ===
> 
> Here's a current version of all the changes that are necessary for  
> jemalloc to go in to the tree:
> 
>     http://people.freebsd.org/~jasone/jemalloc/patches/ 
> jemalloc_20051222c.diff
> 
> This patch includes (roughly):
> 
> 1) Fix gensnmptree bug (missing variable initialization).  I emailed  
> the maintainer, Hartmut Brandt, about this today, and am awaiting a  reply.
> 
> 2) Fix kldxref bug (assumes allocation is adequately aligned).  This  
> needs to be committed along with (5), since the fix uses  posix_memalign().
> 
> 3) Move calloc() from calloc.c to malloc.c (enables better statistics  
> gathering).
> 
> 4) Add _malloc_{pre,post}fork(), for use by threading libraries.
> 
> 5) Replace malloc(), calloc(), realloc(), and free().  Add  
> posix_memalign().
> 
> I'd like to commit (3) and (4) first, so that we have a version of  
> phkmalloc in the cvs repository that is API-compatible with libc as  it 
> will be after jemalloc is committed.  This will make it easier to  swap 
> the phkmalloc code in, should we wish to do further benchmarking  or 
> regression testing at some point in the future.  Poul-Henning has  
> agreed with this in principle, though I haven't yet supplied him with  
> diffs for only (3) and (4).
> 
> So, how about it?  Is jemalloc ready to go in now?
> 
> Thanks,
> Jason

I think that this is overall a good plan.  Two things need to be
stressed, since they will quickly become FAQs.  First is that phkmalloc
and jemalloc won't be interchangable at runtime, like the threading
libraries are.  Second is that amd64 will be stuck without working X
until the 6.9/7.0 stuff gets in the tree, and people should be well
aware of this.

Another thing that I worry about is complex library scenarios where you
might have different versions of libc getting pulled into the same
process by different library dependencies.  This could turn into a big
headache that is only solvable by telling people to wipe out their
entire ports collection and rebuild from scratch.  Does this warrant a
library version bump now (as much as I really don't want to advocate
this)?

Scott