New libc malloc patch
jasone at canonware.com
Sun Dec 11 18:30:50 PST 2005
Thank you for your code review! Specific responses follow.
On Dec 11, 2005, at 5:48 PM, Johan Bucht wrote:
> Just a few comments, not sure if all of them are valid since I don't
> have a recent enough system to apply the patch and test it, so my
> comments are just based on reading the patch. I might have missed some
> thing hidden between all the lines differing. Can you put up a patched
> malloc.c somewhere?
> * Fork safety functions
> Nice to have for all allocators and is something I missed having.
> like to have them regardless if your malloc becomes standard or not.
I think that the implementation is currently fork-safe. The threads
libraries call _malloc_prefork() and _malloc_postfork() in order to
avoid locking issues.
> * magazines, per-arena caches
> If I didn't miss something you don't seem to have it, pardon if I
> them in my quick readings. They can really help the scalability and is
> something that is implemented in Solaris libumem allocator with great
> success. I implemented them and they helped bring the scalability up
> quite a bit on the system I tested my allocator on aswell as bringing
> the allocation latency down.
Arenas do use caches for objects below a size threshold (currently
~1024 bytes on 32-bit systems and ~2048 bytes on 64-bit systems).
The caching mechanism is quite different than magazines, but still
> * Saw you used a tree to look up the allocation sizes at free, this
> * differs from how I and Solaris libumem does it. Instead I went for
> * prepending a 8/16 byte tag containing the size of the allocation for
> * lockless size lookup.
Actually, my implementation prepends a 4 byte tag that contains
allocated/free flags, and the size of the allocation. The trees are
only used to look up the sizes of huge allocations (> 8 MB on 32-bit
platforms, and > 256 MB on 64-bit platforms). Huge allocations start
at chunk boundaries, so prepending a tag would require an entire
extra page (and would cause potentially serious external
fragmentation on 32-bit systems).
> * Bucket sizes
> If I didn't miss something you are using power of two sizes. Might be
> better to use smaller sizes to avoid internal fragmentation, aswell as
> improving locking granularity.
My implementation uses quantum-sized buckets (not power of two size
classes). The quantum size is either 8 or 16 bytes, depending on the
> * Locking granularity
> You use a single lock for the chunk allocation instead of a lock per
> bucket size or something similar. This means that you limit access
> unless threads allocate from their private arenas. Since you hash
> to get
> an arena there might be multiple threads accessing the same arena
> at the
> same time introducing contention. Might be better to have a lock per
> allocation size to somewhat improve the situation.
Using a lock per allocation size requires that slower algorithms be
used in some places (and it certainly complicates certain other
operations). This didn't seem like a worthwhile tradeoff to me. By
making the code faster and simpler, less time is spent in the malloc
code. Unless an app does nothing but malloc/free, I don't expect
this to be an issue. Even microbenchmarks that do nothing but malloc/
free don't appear to suffer.
> * Locking primitive
> The biggest issue and as David Xu pointed out is probably the locking
> primitives. The SPINLOCK use has a limit in the threading library and
> makes is hard to have a lot of mutexes. I ended up using a wrapper
> around the umtx_lock function to get recursive mutexes and it would
> probably be better to extend the umtx functions to handle recursion.
> This would probably also be appreciated by other malloc
> Might be interesting to implement some of the ideas from the Linux
> implementation to help umtx.
I have been contemplating creating a separate spinlock API that
doesn't require the threads library to track the spinlocks across
fork. This would (if I understand correctly) remove the current
static spinlock limitations.
As for supporting recursive spinlocks, I doubt that the overhead
would be acceptable in general. If I could get rid of the need for
the one recursive lock in malloc.c, I certainly would. =)
> * Big allocations
> It might be preferable to use mmap/munmap directly for big allocations
> (bigger than a few pages) to avoid the overhead of trying to be smart.
> These big allocations should be pretty few and shouldn't be that
> pessimized by this. It also means some memory savings as you can
> directly free memory from big allocations.
I don't really see the approach I'm taking as "trying to be smart",
so much as "making it simple by using the same algorithm for almost
everything". Many other malloc implementations I've examined use
three or more allocation approaches, depending on the allocation
size. jemalloc uses only two, and huge allocations really are huge,
such that they rarely if ever happen. In practice, I don't think
there is a real memory savings to be had. Besides, madvise() calls
in strategic locations would achieve approximately the same result.
> * Ownership
> Didn't look that much for it so might have missed it. Basicly what
> happens if you free memory allocated in another arena, do you return
> memory to the arena the memory was allocated from or to the current
> threads arena? Not returning it to the original arena leads to cache
> line sharing. Returning it to the original adds some overhead and it
> might be argued that applications doing this might not be considered
> scalable and should be fixed instead.
Allocated objects can be associated with their parent arenas in
constant time (by masking bits in the object addresses). This makes
it possible to always free memory back to the same arena it came from.
> * Standalone version
> Would be nice to have a standalone version to test the allocator on
> solaris and linux to compare against their allocators. Would be
> nice for
> all the people with only access to SMP machines running another OS. I
> tested my allocator on both Solaris 9 and Freebsd 5/6 and it was both
> helpful in testing scalability and impact on small machines aswell as
> finding bugs in the implementation. I tested my allocator against
> Solaris libc, Hoard and libumem on a 4-cpu Solaris 9 machine (4x
> UltraSparc IIIi, 1600MHz, 16384MB RAM) at the university and tested my
> allocator vs the FreeBSD libc on my 400MHz laptop with too much beer
> and almost non working cpu fan. Beer as in standing in 10cm of beer in
> my backpack =).
> Not sure how much of the implementation is FreeBSD dependant, RB-
I actually have a separate more portable implementation of jemalloc
that is part of a programming language runtime library (jemalloc is
simply a FreeBSD-ized stand-alone version of that allocator). I've
tested it on OS X, Linux, FreeBSD, and (IIRC) Solaris. On Linux I've
compared against ptmalloc2, dlmalloc, and hoard. Unfortunately, that
version of the allocator has to make some performance sacrifices
regarding locking, since it doesn't have access to the pthreads
internals. As such, it requires a lot of care to interpret
performance results, so I haven't felt it to be worthwhile providing
that version here.
> * Focus - Single thread vs multithread
> What are the focus of the allocator, how much memory is it worth to
> waste to implement a good scalable implementation. This is the
> reason I
> think Solaris still keeps a serial allocator in libc and makes libumem
> accessible through LD_PRELOAD and friends.
jemalloc only creates two arenas for single-threaded applications
(one for internal memory management, and one for user requests), so
it doesn't really sacrifice much in the case of single-threaded
programs -- a few pages maybe. In the case of multi-threaded
programs, it potentially sacrifices memory compactness in order to
reduce contention and cache line sharing.
> * Comment
> Might be wise to fix the comment about always using mmap since brk can
> be used.
More information about the freebsd-current