New libc malloc patch

Sun Dec 11 19:15:06 PST 2005

Jason Evans wrote:

> Johan,
>
> Thank you for your code review!  Specific responses follow.

Wouldn't exactly call scrolling through a diff at 3 AM review =).

>
> On Dec 11, 2005, at 5:48 PM, Johan Bucht wrote:
>
>>
>> Just a few comments, not sure if all of them are valid since I don't
>> have a recent enough system to apply the patch and test it, so my
>> comments are just based on reading the patch. I might have missed some
>> thing hidden between all the lines differing. Can you put up a patched
>> malloc.c somewhere?
>
>
> http://www.canonware.com/~jasone/jemalloc/malloc.c
>
Thanks!

>> * Fork safety functions
>> Nice to have for all allocators and is something I missed having.  Would
>> like to have them regardless if your malloc becomes standard or not.
>
>
> I think that the implementation is currently fork-safe.  The threads  
> libraries call _malloc_prefork() and _malloc_postfork() in order to  
> avoid locking issues.
>
Hmm, I meant that the _malloc_prefork() functions are independent from 
your malloc allocation and I would like them committed regardless as 
they make the life easier for other malloc implementation.

>> * magazines, per-arena caches
>> If I didn't miss something you don't seem to have it, pardon if I  
>> missed
>> them in my quick readings. They can really help the scalability and is
>> something that is implemented in Solaris libumem allocator with great
>> success. I implemented them and they helped bring the scalability up
>> quite a bit on the system I tested my allocator on aswell as bringing
>> the allocation latency down.
>
>
> Arenas do use caches for objects below a size threshold (currently  
> ~1024 bytes on 32-bit systems and ~2048 bytes on 64-bit systems).   
> The caching mechanism is quite different than magazines, but still  
> very effective.
>
Ah nice, shows how little review I did. =)
Will look a bit more at this then.

>> * Saw you used a tree to look up the allocation sizes at free, this
>> * differs from how I and Solaris libumem does it. Instead I went for
>> * prepending a 8/16 byte tag containing the size of the allocation for
>> * lockless size lookup.
>
>
> Actually, my implementation prepends a 4 byte tag that contains  
> allocated/free flags, and the size of the allocation.  The trees are  
> only used to look up the sizes of huge allocations (> 8 MB on 32-bit  
> platforms, and > 256 MB on 64-bit platforms).  Huge allocations start  
> at chunk boundaries, so prepending a tag would require an entire  
> extra page (and would cause potentially serious external  
> fragmentation on 32-bit systems).
>
Isn't 8 byte alignment expected by some applications?
How do you know if a allocation is huge if you don't have a tag?
Something more to read up on i guess. =)

>> * Bucket sizes
>> If I didn't miss something you are using power of two sizes. Might be
>> better to use smaller sizes to avoid internal fragmentation, aswell as
>> improving locking granularity.
>
>
> My implementation uses quantum-sized buckets (not power of two size  
> classes).  The quantum size is either 8 or 16 bytes, depending on the  
> machine architecture.
>
Ahh sorry, just saw the pow stuff and assumed it was power of two for 
the buckets.

>> * Locking granularity
>> You use a single lock for the chunk allocation instead of a lock per
>> bucket size or something similar. This means that you limit access
>> unless threads allocate from their private arenas. Since you hash  to 
>> get
>> an arena there might be multiple threads accessing the same arena  at 
>> the
>> same time introducing contention. Might be better to have a lock per
>> allocation size to somewhat improve the situation.
>
>
> Using a lock per allocation size requires that slower algorithms be  
> used in some places (and it certainly complicates certain other  
> operations).  This didn't seem like a worthwhile tradeoff to me.  By  
> making the code faster and simpler, less time is spent in the malloc  
> code.  Unless an app does nothing but malloc/free, I don't expect  
> this to be an issue.  Even microbenchmarks that do nothing but malloc/ 
> free don't appear to suffer.
>
Yeah, not sure how much of an impact it has on your allocator it has, 
just something that improved my allocator as I move magazines between 
local arenas and the global.

>> * Locking primitive
>> The biggest issue and as David Xu pointed out is probably the locking
>> primitives. The SPINLOCK use has a limit in the threading library and
>> makes is hard to have a lot of mutexes. I ended up using a wrapper
>> around the umtx_lock function to get recursive mutexes and it would
>> probably be better to extend the umtx functions to handle recursion.
>> This would probably also be appreciated by other malloc  
>> implementations.
>> Might be interesting to implement some of the ideas from the Linux  
>> futex
>> implementation to help umtx.
>
>
> I have been contemplating creating a separate spinlock API that  
> doesn't require the threads library to track the spinlocks across  
> fork.  This would (if I understand correctly) remove the current  
> static spinlock limitations.
>
> As for supporting recursive spinlocks, I doubt that the overhead  
> would be acceptable in general.  If I could get rid of the need for  
> the one recursive lock in malloc.c, I certainly would. =)
>
Yeah, made some attempts to remove my recursive locks in the global 
arena but it made the code more complicated and didn't really do much.

>> * Big allocations
>> It might be preferable to use mmap/munmap directly for big allocations
>> (bigger than a few pages) to avoid the overhead of trying to be smart.
>> These big allocations should be pretty few and shouldn't be that
>> pessimized by this. It also means some memory savings as you can
>> directly free memory from big allocations.
>
>
> I don't really see the approach I'm taking as "trying to be smart",  
> so much as "making it simple by using the same algorithm for almost  
> everything".  Many other malloc implementations I've examined use  
> three or more allocation approaches, depending on the allocation  
> size.  jemalloc uses only two, and huge allocations really are huge,  
> such that they rarely if ever happen.  In practice, I don't think  
> there is a real memory savings to be had.  Besides, madvise() calls  
> in strategic locations would achieve approximately the same result.
>
Yeah, saving them for future use through madvise might be better. I 
chose the route to keep the big allocations simple and avoid caching and 
looking up stuff for.
Basicly read the size and if it's big munmap it directly.

>> * Ownership
>> Didn't look that much for it so might have missed it. Basicly what
>> happens if you free memory allocated in another arena, do you return
>> memory to the arena the memory was allocated from or to the current
>> threads arena? Not returning it to the original arena leads to cache
>> line sharing. Returning it to the original adds some overhead and it
>> might be argued that applications doing this might not be considered
>> scalable and should be fixed instead.
>
>
> Allocated objects can be associated with their parent arenas in  
> constant time (by masking bits in the object addresses).  This makes  
> it possible to always free memory back to the same arena it came from.
>
Ok, good.

>> * Standalone version
>> Would be nice to have a standalone version to test the allocator on
>> solaris and linux to compare against their allocators. Would be  nice 
>> for
>> all the people with only access to SMP machines running another OS. I
>> tested my allocator on both Solaris 9 and Freebsd 5/6 and it was both
>> helpful in testing scalability and impact on small machines aswell as
>> finding bugs in the implementation. I tested my allocator against
>> Solaris libc, Hoard and libumem on a 4-cpu Solaris 9 machine (4x
>> UltraSparc IIIi, 1600MHz, 16384MB RAM) at the university and tested my
>> allocator vs the FreeBSD libc on my 400MHz laptop with too much beer
>> and almost non working cpu fan. Beer as in standing in 10cm of beer in
>> my backpack =).
>> Not sure how much of the implementation is FreeBSD dependant, RB- 
>> trees?,
>
>
> I actually have a separate more portable implementation of jemalloc  
> that is part of a programming language runtime library (jemalloc is  
> simply a FreeBSD-ized stand-alone version of that allocator).  I've  
> tested it on OS X, Linux, FreeBSD, and (IIRC) Solaris.  On Linux I've  
> compared against ptmalloc2, dlmalloc, and hoard.  Unfortunately, that  
> version of the allocator has to make some performance sacrifices  
> regarding locking, since it doesn't have access to the pthreads  
> internals.  As such, it requires a lot of care to interpret  
> performance results, so I haven't felt it to be worthwhile providing  
> that version here.
>
ok

>> * Focus - Single thread vs multithread
>> What are the focus of the allocator, how much memory is it worth to
>> waste to implement a good scalable implementation. This is the  reason I
>> think Solaris still keeps a serial allocator in libc and makes libumem
>> accessible through LD_PRELOAD and friends.
>
>
> jemalloc only creates two arenas for single-threaded applications  
> (one for internal memory management, and one for user requests), so  
> it doesn't really sacrifice much in the case of single-threaded  
> programs -- a few pages maybe.  In the case of multi-threaded  
> programs, it potentially sacrifices memory compactness in order to  
> reduce contention and cache line sharing.
>
As I have not tested it I take your word. Mostly wanted to know where 
you stand in the issue of pessimizing single-threaded apps vs MT. And if 
it's worth keeping phkmalloc for some apps.

>> * Comment
>> Might be wise to fix the comment about always using mmap since brk can
>> be used.
>
>
> Thanks, done.
>
> Jason

No problem, always fun to have some code that you actually understand 
the idea behind and see how it's implemented.

/Johan Bucht