New malloc ready, take 42
Jason Evans
jasone at freebsd.org
Fri Dec 23 01:14:16 PST 2005
On 28 November, I announced a new malloc implementation that I've
been working on as a replacement for the current libc malloc. The
main goal for this rewrite of malloc is to provide better scalability
for multi-threaded programs on SMP systems, without degrading
performance in general. The benchmarks I've run indicate that the
code now meets this goal.
===
=== Resolved issues ===
===
Several people provided valuable feedback regarding architecture, (in)
correct function, and performance. At this point, I think I have
addressed all outstanding issues. Here's a recap:
* Hiten Pandya and others requested that I used the sys/tree.h
implementation of red-black trees rather than my own. This is done.
* Jon Dama objected to leaving sbrk() support out of the malloc
implementation for 32-bit architectures. I've added support for sbrk
() on i386 and arm.
* David Xu pointed out that the hand-rolled spinlocks did not do
priority inheritance, which could lead to priority inversion
deadlocks. This has been fixed by using the libc spinlocks.
* Kris Kennaway discovered that sparc64 doesn't have TLS, and Olivier
Houchard contacted me about the same issue for arm. I removed the
use of TLS on those architectures.
* Claus Guttesen and Devon O'Dell reported issues in kldxref and X on
amd64. The kldxref problem has been fixed. The X problem is a bit
trickier. I borrowed a Turion-based laptop from Rob Braun and indeed
found that X wouldn't start. Eric Anholt suggested trying the xorg-
server-snap port rather than xorg-server, and the problem went away.
Eric explained that xorg-server uses custom ELF *.a module loading
(the idea being that the same exact module could be used on different
operating systems), whereas xorg-server-snap uses dlopen(3) and
friends. Apparently, we are on the cusp of the transition to using
the latter, so for now the workaround on amd64 is to use xorg-server-
snap, with the expectation that soon the standard xorg-server port
will be upgraded.
===
=== Benchmarking ad nauseam ===
===
==> Lever & Boreham benchmarks (multi-threaded)
Chuck Lever and David Boreham published "malloc() Performance in a
Multithreaded Linux Environment" in the proceedings of the FREENIX
track of the 2000 USENIX Annual Technical Conference. By some
incredible stroke of luck, I found the source code for their
benchmarks, and Kris Kennaway ran the first of them on a 14-CPU
sparc64 system as well as a dual-dual opteron system. (The other two
benchmarks aren't very enlightening.) It's important to keep in mind
that this is a micro-benchmark that demonstrates a worst case
scenario for non-SMP-scalable malloc implementations. That said,
with 5 threads, jemalloc outperformed phkmalloc by 15X on sparc64 and
80X on amd64.
==> super-smack (multi-threaded)
The SMP scalability tests were very encouraging, but Kris also ran
some super-smack benchmarks on a dual-dual opteron system, and
jemalloc was actually a bit slower than phkmalloc in that case
(~2.7-3.7% slower, depending on jemalloc configuration). I've been
obsessing over that for over a week now, and one of the results is
that jemalloc is now more tunable -- delayed coalescence cache size,
number of allocation arenas, and allocation quantum size are all
tunable via MALLOC_OPTIONS now.
However, changing the tuning parameters wasn't enough to make up the
performance difference for the super-smack benchmarks. So, I spent
some time looking at memory fragmentation. Taking a cue from Poul-
Henning Kamp's malloc work, I finally wrote a program that converts
ktrace output to memory usage graphs. Here's a series of of graphs
from various stages of the fragmentation avoidance changes I made.
The test program was 'kldxref /boot/kernel' on amd64.
http://people.freebsd.org/~jasone/jemalloc/plots/kldxref.png
http://people.freebsd.org/~jasone/jemalloc/plots/kldxref13.gif
http://people.freebsd.org/~jasone/jemalloc/plots/kldxref15.gif
The interesting thing to note is that in the first two plots, the
highly fragmented portion of the graph continuously grows, wheareas
in the last plot, the fragmented portion stops growing about 1/3 of
the way through the program run. (The first plot only shows the
bottom half of what the other two plots show.) This is all really
cool, but alas, it didn't help the super-smack results. I was able
to use the graphing program to determine that cache line sharing was
potentially causing even more issues than suspected, but we already
knew that cache line sharing was an issue as a result of some of the
runs that Kris did.
Finally, last night I discovered that overzealous use of __inline is
likely to have been the culprit. I removed the __inline keyword from
all of the functions that aren't part of the critical path for small
memory allocation/deallocation, and measured a substantial
performance improvement.
I don't have access to the machine that Kris ran the super-smack
benchmarks on, so I set up a couple of VMware guest systems using
VMware 5.5 on a 3.2 GHz P4 system (HTT enabled for the host, but not
the guests). The two guests were configured very similarly, so that
the only pertinent difference was that one was using phkmalloc, and
the other was using jemalloc. Before the __inline changes, jemalloc
was ~2.5% slower than phkmalloc. Here are the final results, which
show jemalloc to be ~4.1% faster than phkmalloc for the benchmark:
x ss_phk
+ ss_je
+-----------------------------------------------------------------------
---+
|xxx x x x x x x + x + ++ ++ + +
+ +|
| |_____________A__M__________| |
___________A___M______| |
+-----------------------------------------------------------------------
---+
N Min Max Median Avg
Stddev
x 10 1594.94 1658.64 1623.73 1619.431
21.833342
+ 10 1649.06 1706.75 1693.23 1686.275
17.410025
Difference at 95.0% confidence
66.844 +/- 18.5532
4.12762% +/- 1.14566%
(Student's t, pooled s = 19.7459)
The numbers reported are queries/second, so bigger is better.
I used mysql 5.0.16, super-smack 1.3, libthr, and
MALLOC_OPTIONS='aj', with 10 replicates of the following command for
each VMware guest:
super-smack -d mysql select-key.smack 4 10000
It's worth noting that, as near as I can tell, almost all memory
allocation happens in a single mysqld thread. Presumably, a master
thread is handing buffers to slaves, which process the buffers, then
free them. As such, this test doesn't benefit from jemalloc's
multiple arenas, so it is not a very good test of SMP scalability, at
least with regard to malloc.
==> cfrac (single-threaded)
cfrac is one of many benchmark programs that is included in the Hoard
memory allocator source distribution (see http://www.cs.umass.edu/
~emery/hoard/). I ran cfrac as follows:
cfrac 47582602774358115722167492755475367767
(Same VMware setup as for super-smack benchmarks.)
Wall time (seconds)
phkmalloc 'aj': 18.75, 18.68, 18.69
jemalloc 'aj': 17.57, 17.65, 17.56
I haven't looked at allocation traces of cfrac in quite a while, but
IIRC, it does a lot of small allocations of various sizes.
==> sh6bench (single-threaded)
sh6bench (see http://www.microquill.com/smartheap/shbench/bench.zip)
is a quirky malloc benchmark that has been used in some of the
published malloc literature, so I include it here. sh6bench does
cycles of allocating groups of objects, where the size of objects in
each group is random. Sometimes the objects are held for a while
before being freed. I ran sh6bench with the following interactive
runtime parameters:
call count: 50000
min block size: 1
max block size: 1000
phkmalloc doesn't fare very well with its default settings since the
benchmark's memory usage fluctuates enough to cause phkmalloc to
repeatedly allocate and free pages. Therefore I report numbers for
multiple MALLOC_OPTIONS settings. Since the program prompts for
interactive input, I report the sum of user and sys time, rather than
wall time. (Same VMware setup as for super-smack benchmarks.)
user+sys time (seconds)
phkmalloc 'aj': 25.66, 25.78, 22.50
phkmalloc 'aj>>>>': 13.72, 13.77, 13.69
jemalloc 'aj': 17.88, 17.05, 17.10
If phkmalloc's cache size is increased adequately, it beats
jemalloc. jemalloc simply has to do more work when splitting and
coalescing regions than phkmalloc does, and this benchmark severely
stresses that aspect of jemalloc. It's perhaps worth noting that
jemalloc's peak memory usage is ~22% lower than phkmalloc's, which
means that it's doing a better job of avoiding fragmentation. At
least the extra algorithmic overhead gains us something.
==> gs (single-threaded)
Ghostscript is the well known PostScript interpreter. I took the
biggest PDF I have (the PostScript Lanuguage Reference, 3rd Ed.) and
converted it to PostScript, then ran AFPL Ghostscript 8.53 as follows:
gs -dBATCH -dNODISPLAY PS3.ps
As with sh6bench, memory usage fluctuated enough to cause excessive
allocation/deallocation of pages with phkmalloc, so I report times
for multiple MALLOC_OPTIONS settings. (Same VMware setup as for
super-smack benchmarks.)
Wall time (seconds)
phkmalloc 'aj': 92.12, 92.03, 92.90
phkmalloc 'aj>>>>': 61.11, 61.29, 61.73
jemalloc 'aj': 62.34, 62.43, 62.20
jemalloc 'ajQQQc': 60.85, 60.48, 60.81
Okay, the 'ajQQQc' parameterization for jemalloc is a bit silly, but
if phkmalloc gets help, then so should jemalloc. =)
===
=== Proposed approach for inclusion in CURRENT ===
===
Here's a current version of all the changes that are necessary for
jemalloc to go in to the tree:
http://people.freebsd.org/~jasone/jemalloc/patches/
jemalloc_20051222c.diff
This patch includes (roughly):
1) Fix gensnmptree bug (missing variable initialization). I emailed
the maintainer, Hartmut Brandt, about this today, and am awaiting a
reply.
2) Fix kldxref bug (assumes allocation is adequately aligned). This
needs to be committed along with (5), since the fix uses
posix_memalign().
3) Move calloc() from calloc.c to malloc.c (enables better statistics
gathering).
4) Add _malloc_{pre,post}fork(), for use by threading libraries.
5) Replace malloc(), calloc(), realloc(), and free(). Add
posix_memalign().
I'd like to commit (3) and (4) first, so that we have a version of
phkmalloc in the cvs repository that is API-compatible with libc as
it will be after jemalloc is committed. This will make it easier to
swap the phkmalloc code in, should we wish to do further benchmarking
or regression testing at some point in the future. Poul-Henning has
agreed with this in principle, though I haven't yet supplied him with
diffs for only (3) and (4).
So, how about it? Is jemalloc ready to go in now?
Thanks,
Jason
More information about the freebsd-current
mailing list