New malloc ready, take 42

Fri Dec 23 01:14:16 PST 2005

On 28 November, I announced a new malloc implementation that I've  
been working on as a replacement for the current libc malloc.  The  
main goal for this rewrite of malloc is to provide better scalability  
for multi-threaded programs on SMP systems, without degrading  
performance in general.  The benchmarks I've run indicate that the  
code now meets this goal.

===
=== Resolved issues ===
===

Several people provided valuable feedback regarding architecture, (in) 
correct function, and performance.  At this point, I think I have  
addressed all outstanding issues.  Here's a recap:

* Hiten Pandya and others requested that I used the sys/tree.h  
implementation of red-black trees rather than my own.  This is done.

* Jon Dama objected to leaving sbrk() support out of the malloc  
implementation for 32-bit architectures.  I've added support for sbrk 
() on i386 and arm.

* David Xu pointed out that the hand-rolled spinlocks did not do  
priority inheritance, which could lead to priority inversion  
deadlocks.  This has been fixed by using the libc spinlocks.

* Kris Kennaway discovered that sparc64 doesn't have TLS, and Olivier  
Houchard contacted me about the same issue for arm.  I removed the  
use of TLS on those architectures.

* Claus Guttesen and Devon O'Dell reported issues in kldxref and X on  
amd64.  The kldxref problem has been fixed.  The X problem is a bit  
trickier.  I borrowed a Turion-based laptop from Rob Braun and indeed  
found that X wouldn't start.  Eric Anholt suggested trying the xorg- 
server-snap port rather than xorg-server, and the problem went away.   
Eric explained that xorg-server uses custom ELF *.a module loading  
(the idea being that the same exact module could be used on different  
operating systems), whereas xorg-server-snap uses dlopen(3) and  
friends.  Apparently, we are on the cusp of the transition to using  
the latter, so for now the workaround on amd64 is to use xorg-server- 
snap, with the expectation that soon the standard xorg-server port  
will be upgraded.

===
=== Benchmarking ad nauseam ===
===

==> Lever & Boreham benchmarks (multi-threaded)

Chuck Lever and David Boreham published "malloc() Performance in a  
Multithreaded Linux Environment" in the proceedings of the FREENIX  
track of the 2000 USENIX Annual Technical Conference.  By some  
incredible stroke of luck, I found the source code for their  
benchmarks, and Kris Kennaway ran the first of them on a 14-CPU  
sparc64 system as well as a dual-dual opteron system.  (The other two  
benchmarks aren't very enlightening.)  It's important to keep in mind  
that this is a micro-benchmark that demonstrates a worst case  
scenario for non-SMP-scalable malloc implementations.  That said,  
with 5 threads, jemalloc outperformed phkmalloc by 15X on sparc64 and  
80X on amd64.

==> super-smack (multi-threaded)

The SMP scalability tests were very encouraging, but Kris also ran  
some super-smack benchmarks on a dual-dual opteron system, and  
jemalloc was actually a bit slower than phkmalloc in that case  
(~2.7-3.7% slower, depending on jemalloc configuration).  I've been  
obsessing over that for over a week now, and one of the results is  
that jemalloc is now more tunable -- delayed coalescence cache size,  
number of allocation arenas, and allocation quantum size are all  
tunable via MALLOC_OPTIONS now.

However, changing the tuning parameters wasn't enough to make up the  
performance difference for the super-smack benchmarks.  So, I spent  
some time looking at memory fragmentation.  Taking a cue from Poul- 
Henning Kamp's malloc work, I finally wrote a program that converts  
ktrace output to memory usage graphs.  Here's a series of of graphs  
from various stages of the fragmentation avoidance changes I made.   
The test program was 'kldxref /boot/kernel' on amd64.

	http://people.freebsd.org/~jasone/jemalloc/plots/kldxref.png
	http://people.freebsd.org/~jasone/jemalloc/plots/kldxref13.gif
	http://people.freebsd.org/~jasone/jemalloc/plots/kldxref15.gif

The interesting thing to note is that in the first two plots, the  
highly fragmented portion of the graph continuously grows, wheareas  
in the last plot, the fragmented portion stops growing about 1/3 of  
the way through the program run.  (The first plot only shows the  
bottom half of what the other two plots show.)  This is all really  
cool, but alas, it didn't help the super-smack results.  I was able  
to use the graphing program to determine that cache line sharing was  
potentially causing even more issues than suspected, but we already  
knew that cache line sharing was an issue as a result of some of the  
runs that Kris did.

Finally, last night I discovered that overzealous use of __inline is  
likely to have been the culprit.  I removed the __inline keyword from  
all of the functions that aren't part of the critical path for small  
memory allocation/deallocation, and measured a substantial  
performance improvement.

I don't have access to the machine that Kris ran the super-smack  
benchmarks on, so I set up a couple of VMware guest systems using  
VMware 5.5 on a 3.2 GHz P4 system (HTT enabled for the host, but not  
the guests).  The two guests were configured very similarly, so that  
the only pertinent difference was that one was using phkmalloc, and  
the other was using jemalloc.  Before the __inline changes, jemalloc  
was ~2.5% slower than phkmalloc.  Here are the final results, which  
show jemalloc to be ~4.1% faster than phkmalloc for the benchmark:

x ss_phk
+ ss_je
+----------------------------------------------------------------------- 
---+
|xxx  x       x     x  x x       x  +      x       +  ++        ++ + + 
+   +|
|  |_____________A__M__________|                 | 
___________A___M______|  |
+----------------------------------------------------------------------- 
---+
     N           Min           Max        Median           Avg         
Stddev
x  10       1594.94       1658.64       1623.73      1619.431      
21.833342
+  10       1649.06       1706.75       1693.23      1686.275      
17.410025
Difference at 95.0% confidence
         66.844 +/- 18.5532
         4.12762% +/- 1.14566%
         (Student's t, pooled s = 19.7459)

The numbers reported are queries/second, so bigger is better.

I used mysql 5.0.16, super-smack 1.3, libthr, and  
MALLOC_OPTIONS='aj', with 10 replicates of the following command for  
each VMware guest:

	super-smack -d mysql select-key.smack 4 10000

It's worth noting that, as near as I can tell, almost all memory  
allocation happens in a single mysqld thread.  Presumably, a master  
thread is handing buffers to slaves, which process the buffers, then  
free them.  As such, this test doesn't benefit from jemalloc's  
multiple arenas, so it is not a very good test of SMP scalability, at  
least with regard to malloc.

==> cfrac (single-threaded)

cfrac is one of many benchmark programs that is included in the Hoard  
memory allocator source distribution (see http://www.cs.umass.edu/ 
~emery/hoard/).  I ran cfrac as follows:

	cfrac 47582602774358115722167492755475367767

(Same VMware setup as for super-smack benchmarks.)

                 Wall time (seconds)
phkmalloc 'aj': 18.75, 18.68, 18.69
jemalloc 'aj':  17.57, 17.65, 17.56

I haven't looked at allocation traces of cfrac in quite a while, but  
IIRC, it does a lot of small allocations of various sizes.

==> sh6bench (single-threaded)

sh6bench (see http://www.microquill.com/smartheap/shbench/bench.zip)  
is a quirky malloc benchmark that has been used in some of the  
published malloc literature, so I include it here.  sh6bench does  
cycles of allocating groups of objects, where the size of objects in  
each group is random.  Sometimes the objects are held for a while  
before being freed.   I ran sh6bench with the following interactive  
runtime parameters:

call count: 50000
min block size: 1
max block size: 1000

phkmalloc doesn't fare very well with its default settings since the  
benchmark's memory usage fluctuates enough to cause phkmalloc to  
repeatedly allocate and free pages.  Therefore I report numbers for  
multiple MALLOC_OPTIONS settings.  Since the program prompts for  
interactive input, I report the sum of user and sys time, rather than  
wall time.  (Same VMware setup as for super-smack benchmarks.)

                     user+sys time (seconds)
phkmalloc 'aj':     25.66, 25.78, 22.50
phkmalloc 'aj>>>>': 13.72, 13.77, 13.69
jemalloc 'aj':      17.88, 17.05, 17.10

If phkmalloc's cache size is increased adequately, it beats  
jemalloc.  jemalloc simply has to do more work when splitting and  
coalescing regions than phkmalloc does, and this benchmark severely  
stresses that aspect of jemalloc.  It's perhaps worth noting that  
jemalloc's peak memory usage is ~22% lower than phkmalloc's, which  
means that it's doing a better job of avoiding fragmentation.  At  
least the extra algorithmic overhead gains us something.

==> gs (single-threaded)

Ghostscript is the well known PostScript interpreter.  I took the  
biggest PDF I have (the PostScript Lanuguage Reference, 3rd Ed.) and  
converted it to PostScript, then ran AFPL Ghostscript 8.53 as follows:

	gs -dBATCH -dNODISPLAY PS3.ps

As with sh6bench, memory usage fluctuated enough to cause excessive  
allocation/deallocation of pages with phkmalloc, so I report times  
for multiple MALLOC_OPTIONS settings.  (Same VMware setup as for  
super-smack benchmarks.)

                     Wall time (seconds)
phkmalloc 'aj':     92.12, 92.03, 92.90
phkmalloc 'aj>>>>': 61.11, 61.29, 61.73
jemalloc  'aj':     62.34, 62.43, 62.20
jemalloc  'ajQQQc': 60.85, 60.48, 60.81

Okay, the 'ajQQQc' parameterization for jemalloc is a bit silly, but  
if phkmalloc gets help, then so should jemalloc. =)

===
=== Proposed approach for inclusion in CURRENT ===
===

Here's a current version of all the changes that are necessary for  
jemalloc to go in to the tree:

	http://people.freebsd.org/~jasone/jemalloc/patches/ 
jemalloc_20051222c.diff

This patch includes (roughly):

1) Fix gensnmptree bug (missing variable initialization).  I emailed  
the maintainer, Hartmut Brandt, about this today, and am awaiting a  
reply.

2) Fix kldxref bug (assumes allocation is adequately aligned).  This  
needs to be committed along with (5), since the fix uses  
posix_memalign().

3) Move calloc() from calloc.c to malloc.c (enables better statistics  
gathering).

4) Add _malloc_{pre,post}fork(), for use by threading libraries.

5) Replace malloc(), calloc(), realloc(), and free().  Add  
posix_memalign().

I'd like to commit (3) and (4) first, so that we have a version of  
phkmalloc in the cvs repository that is API-compatible with libc as  
it will be after jemalloc is committed.  This will make it easier to  
swap the phkmalloc code in, should we wish to do further benchmarking  
or regression testing at some point in the future.  Poul-Henning has  
agreed with this in principle, though I haven't yet supplied him with  
diffs for only (3) and (4).

So, how about it?  Is jemalloc ready to go in now?

Thanks,
Jason