malloc(3) (hopefully) set for 7.0

Wed Mar 28 22:36:38 UTC 2007

=== Introduction ===

Those of you who watch the cvs commit logs may have noticed several 
rather significant changes to malloc over the past few weeks.  I took a 
fresh look at malloc after letting the details fade over the past year, 
and saw some significant opportunities for improvement.  I appreciate 
the FreeBSD community's tolerance of these changes, which were perhaps 
of questionable wisdom given that we are now in code slush. =)

Since these changes obsolete several aspects of the paper I presented 
last year at BSDcan, it seems like a good idea to provide a public 
summary of these changes and their rationale.  The changes focus on two 
main issues: 1) mapped page reduction during low memory usage phases of 
application execution, and 2) fragmentation reduction.

=== 1) Mapped page reduction ===

First, here is a bit of relevant background.  jemalloc breaks memory 
into 1MB chunks that are aligned at multiples of the chunk size.  This 
makes it possible to find metadata for small and large user objects in 
constant time.  (For huge objects, it takes lg(n) time, where n is the 
number of huge allocations.)  The point is that chunks are a fundamental 
aspect of what makes jemalloc tick.  However, there is a downside to 
managing memory in chunks: unless we use madvise(2), every page of a 
chunk that is touched remains physically backed until the entire chunk 
can be unmapped.  Thus, even a single in-use page can cause retention of 
up to 1MB of memory.  This is a bit of a catch-22, since madvise(2) is 
too expensive to enable by default, but if we really need it, then we 
really should be using it for all applications, since paging is a 
system-wide issue.  There is no simple way out of this conundrum, which 
is why I have been working to reduce the number of chunks that remain 
mapped for programs that typically operate at well below their peak 
memory usage.

In order to reduce the number of mapped pages, we need some way to 
preferentially pack data into certain chunks, so that others can drain. 
  There is a simple allocation policy that in practice had been 
demonstrated to work very well: prefer low addresses.  jemalloc already 
did this where convenient (a lesson learned from phkmalloc), but the 
recent changes push this policy throughout the allocator, almost to the 
maximum extreme possible.

Each arena manages a set of chunks that are carved into runs.  These 
runs back small and large allocations.  jemalloc now uses the lowest fit 
for run allocation.  Equal-sized small allocations are grouped together 
into runs since it is  possible to fit multiple small objects in a 
single run.  This means that such a run may have very low utilization, 
so we try to keep runs as full as possible in order to reduce the total 
number of runs in use for any given size class.  Previously, jemalloc 
used an overly complicated run promotion/demotion system that was 
intended to place several consecutive allocation requests in the same 
run, reduce the overhead of switching between runs, and try to keep the 
runs moderately full.  However, experiments demonstrated that run 
switching overhead has at most a small impact on allocator performance, 
and furthermore that the promotion/demotion algorithms were not very 
effective at decreasing run switching anyway.  That being the case, I 
experimented with increasingly expensive run switching  policies. 
Finally, I tried always switching to the lowest run with any unused 
space, which was the ideal layout policy given my goals.  This change 
had the intended effect of tending to let high chunks drain.  Much to my 
surprise though, there was no significant impact on runtime.

I put this last experiment off a long time, because my intuition was 
that it would be far too expensive to add so many red-black tree 
modifications.  I think there are two contributing factors to why I was 
wrong: 1) the highest run switching rate I saw for any application was 
~1% of the total allocation rate (though carefully crafted code could 
push that up to nearly 50%), and 2) cache locality improvements tend to 
help a bit.  My initial expectation would be for such a layout policy to 
*decrease* locality, since we have the potential to spread consecutive 
allocations across a large number of independent runs.  However, in the 
steady state of a long-running application, we have a hard time dealing 
with this problem anyway, and by packing objects as low as possible in 
memory, we actually tend to do better than if we were to use non-full 
runs in MRU order.  Anyway, the reason I dedicate so much space to this 
issue is that it still makes me uneasy, but I have been unable to find 
any application for which the layout policy causes significant 
performance degradation.  If you think you may have found such a case, 
you will be able to gain some insight by looking at the statistics 
output from setting MALLOC_OPTIONS=P.  Look at the 'reruns' column (# of 
run switches) versus the 'requests' column in order to determine what 
percentage of requests suffer the overhead of run switching.  If you see 
very high run switch rates for a particular application (say, >5%), 
please tell me so that I can run further experiments.

=== 2) Fragmentation reduction ===

We want to fit user objects in memory as tightly as possible.  jemalloc 
was doing a substandard job in a couple of regards.  First, it used a 
binary buddy scheme to manage runs within chunks, meaning that runs were 
constrained to be 2^n pages, for 0 < n < 20.  This caused significant 
internal fragmentation for, say, 20kB objects.  To improve the 
situation, I switched to using a straightforward extent-based system. 
This was initially intended only as a fix for the problem just 
described, but it turned out to also be useful for runs that back small 
allocations.  Small allocation runs can now be more precisely sized such 
that the run header overhead is barely below some threshold.  This 
allows the bitmaps that track which small objects are in use to be as 
short as possible while still keeping metadata overhead under control.

Previously, all run headers for small object runs were the same size, 
regardless of how many bits were actually needed in the header bitmap to 
track the run's objects.  This was clearly wasteful.  My memory is fuzzy 
regarding the original rationale for fixed-size run headers, but I think 
it had to do with the complexity of calculating run header size, number 
of runs, and run size all at the same time.  Indeed, this proved to be 
tricky, but well worth the effort.

=== Summary ===

With these changes in place, all of my experiments indicate substantial 
improvements in memory usage, as well as overall speed improvements.  We 
still have several months before the 7.0 release to verify that I didn't 
miss anything important.  Even so, I tested these changes more 
thoroughly than any previous changes (excepting Kris Kennaway's testing 
before jemalloc entered cvs), so I think the risk of major problems is 
reasonably low.

Other than bug fixes, I do not intend to make any more changes before 
7.0.  I have developed some novel algorithms for essentially eliminating 
thread contention on SMP systems, but it is too late in the development 
cycle to introduce such changes (not to mention that I lack the hardware 
  to evaluate the algorithms).  Thanks again for your patience and 
support.  Please let me know if I can be of help in diagnosing suspected 
malloc issues.

Jason