New libc malloc patch
Johan Bucht
bucht at acc.umu.se
Sun Dec 11 19:15:06 PST 2005
Jason Evans wrote:
> Johan,
>
> Thank you for your code review! Specific responses follow.
Wouldn't exactly call scrolling through a diff at 3 AM review =).
>
> On Dec 11, 2005, at 5:48 PM, Johan Bucht wrote:
>
>>
>> Just a few comments, not sure if all of them are valid since I don't
>> have a recent enough system to apply the patch and test it, so my
>> comments are just based on reading the patch. I might have missed some
>> thing hidden between all the lines differing. Can you put up a patched
>> malloc.c somewhere?
>
>
> http://www.canonware.com/~jasone/jemalloc/malloc.c
>
Thanks!
>> * Fork safety functions
>> Nice to have for all allocators and is something I missed having. Would
>> like to have them regardless if your malloc becomes standard or not.
>
>
> I think that the implementation is currently fork-safe. The threads
> libraries call _malloc_prefork() and _malloc_postfork() in order to
> avoid locking issues.
>
Hmm, I meant that the _malloc_prefork() functions are independent from
your malloc allocation and I would like them committed regardless as
they make the life easier for other malloc implementation.
>> * magazines, per-arena caches
>> If I didn't miss something you don't seem to have it, pardon if I
>> missed
>> them in my quick readings. They can really help the scalability and is
>> something that is implemented in Solaris libumem allocator with great
>> success. I implemented them and they helped bring the scalability up
>> quite a bit on the system I tested my allocator on aswell as bringing
>> the allocation latency down.
>
>
> Arenas do use caches for objects below a size threshold (currently
> ~1024 bytes on 32-bit systems and ~2048 bytes on 64-bit systems).
> The caching mechanism is quite different than magazines, but still
> very effective.
>
Ah nice, shows how little review I did. =)
Will look a bit more at this then.
>> * Saw you used a tree to look up the allocation sizes at free, this
>> * differs from how I and Solaris libumem does it. Instead I went for
>> * prepending a 8/16 byte tag containing the size of the allocation for
>> * lockless size lookup.
>
>
> Actually, my implementation prepends a 4 byte tag that contains
> allocated/free flags, and the size of the allocation. The trees are
> only used to look up the sizes of huge allocations (> 8 MB on 32-bit
> platforms, and > 256 MB on 64-bit platforms). Huge allocations start
> at chunk boundaries, so prepending a tag would require an entire
> extra page (and would cause potentially serious external
> fragmentation on 32-bit systems).
>
Isn't 8 byte alignment expected by some applications?
How do you know if a allocation is huge if you don't have a tag?
Something more to read up on i guess. =)
>> * Bucket sizes
>> If I didn't miss something you are using power of two sizes. Might be
>> better to use smaller sizes to avoid internal fragmentation, aswell as
>> improving locking granularity.
>
>
> My implementation uses quantum-sized buckets (not power of two size
> classes). The quantum size is either 8 or 16 bytes, depending on the
> machine architecture.
>
Ahh sorry, just saw the pow stuff and assumed it was power of two for
the buckets.
>> * Locking granularity
>> You use a single lock for the chunk allocation instead of a lock per
>> bucket size or something similar. This means that you limit access
>> unless threads allocate from their private arenas. Since you hash to
>> get
>> an arena there might be multiple threads accessing the same arena at
>> the
>> same time introducing contention. Might be better to have a lock per
>> allocation size to somewhat improve the situation.
>
>
> Using a lock per allocation size requires that slower algorithms be
> used in some places (and it certainly complicates certain other
> operations). This didn't seem like a worthwhile tradeoff to me. By
> making the code faster and simpler, less time is spent in the malloc
> code. Unless an app does nothing but malloc/free, I don't expect
> this to be an issue. Even microbenchmarks that do nothing but malloc/
> free don't appear to suffer.
>
Yeah, not sure how much of an impact it has on your allocator it has,
just something that improved my allocator as I move magazines between
local arenas and the global.
>> * Locking primitive
>> The biggest issue and as David Xu pointed out is probably the locking
>> primitives. The SPINLOCK use has a limit in the threading library and
>> makes is hard to have a lot of mutexes. I ended up using a wrapper
>> around the umtx_lock function to get recursive mutexes and it would
>> probably be better to extend the umtx functions to handle recursion.
>> This would probably also be appreciated by other malloc
>> implementations.
>> Might be interesting to implement some of the ideas from the Linux
>> futex
>> implementation to help umtx.
>
>
> I have been contemplating creating a separate spinlock API that
> doesn't require the threads library to track the spinlocks across
> fork. This would (if I understand correctly) remove the current
> static spinlock limitations.
>
> As for supporting recursive spinlocks, I doubt that the overhead
> would be acceptable in general. If I could get rid of the need for
> the one recursive lock in malloc.c, I certainly would. =)
>
Yeah, made some attempts to remove my recursive locks in the global
arena but it made the code more complicated and didn't really do much.
>> * Big allocations
>> It might be preferable to use mmap/munmap directly for big allocations
>> (bigger than a few pages) to avoid the overhead of trying to be smart.
>> These big allocations should be pretty few and shouldn't be that
>> pessimized by this. It also means some memory savings as you can
>> directly free memory from big allocations.
>
>
> I don't really see the approach I'm taking as "trying to be smart",
> so much as "making it simple by using the same algorithm for almost
> everything". Many other malloc implementations I've examined use
> three or more allocation approaches, depending on the allocation
> size. jemalloc uses only two, and huge allocations really are huge,
> such that they rarely if ever happen. In practice, I don't think
> there is a real memory savings to be had. Besides, madvise() calls
> in strategic locations would achieve approximately the same result.
>
Yeah, saving them for future use through madvise might be better. I
chose the route to keep the big allocations simple and avoid caching and
looking up stuff for.
Basicly read the size and if it's big munmap it directly.
>> * Ownership
>> Didn't look that much for it so might have missed it. Basicly what
>> happens if you free memory allocated in another arena, do you return
>> memory to the arena the memory was allocated from or to the current
>> threads arena? Not returning it to the original arena leads to cache
>> line sharing. Returning it to the original adds some overhead and it
>> might be argued that applications doing this might not be considered
>> scalable and should be fixed instead.
>
>
> Allocated objects can be associated with their parent arenas in
> constant time (by masking bits in the object addresses). This makes
> it possible to always free memory back to the same arena it came from.
>
Ok, good.
>> * Standalone version
>> Would be nice to have a standalone version to test the allocator on
>> solaris and linux to compare against their allocators. Would be nice
>> for
>> all the people with only access to SMP machines running another OS. I
>> tested my allocator on both Solaris 9 and Freebsd 5/6 and it was both
>> helpful in testing scalability and impact on small machines aswell as
>> finding bugs in the implementation. I tested my allocator against
>> Solaris libc, Hoard and libumem on a 4-cpu Solaris 9 machine (4x
>> UltraSparc IIIi, 1600MHz, 16384MB RAM) at the university and tested my
>> allocator vs the FreeBSD libc on my 400MHz laptop with too much beer
>> and almost non working cpu fan. Beer as in standing in 10cm of beer in
>> my backpack =).
>> Not sure how much of the implementation is FreeBSD dependant, RB-
>> trees?,
>
>
> I actually have a separate more portable implementation of jemalloc
> that is part of a programming language runtime library (jemalloc is
> simply a FreeBSD-ized stand-alone version of that allocator). I've
> tested it on OS X, Linux, FreeBSD, and (IIRC) Solaris. On Linux I've
> compared against ptmalloc2, dlmalloc, and hoard. Unfortunately, that
> version of the allocator has to make some performance sacrifices
> regarding locking, since it doesn't have access to the pthreads
> internals. As such, it requires a lot of care to interpret
> performance results, so I haven't felt it to be worthwhile providing
> that version here.
>
ok
>> * Focus - Single thread vs multithread
>> What are the focus of the allocator, how much memory is it worth to
>> waste to implement a good scalable implementation. This is the reason I
>> think Solaris still keeps a serial allocator in libc and makes libumem
>> accessible through LD_PRELOAD and friends.
>
>
> jemalloc only creates two arenas for single-threaded applications
> (one for internal memory management, and one for user requests), so
> it doesn't really sacrifice much in the case of single-threaded
> programs -- a few pages maybe. In the case of multi-threaded
> programs, it potentially sacrifices memory compactness in order to
> reduce contention and cache line sharing.
>
As I have not tested it I take your word. Mostly wanted to know where
you stand in the issue of pessimizing single-threaded apps vs MT. And if
it's worth keeping phkmalloc for some apps.
>> * Comment
>> Might be wise to fix the comment about always using mmap since brk can
>> be used.
>
>
> Thanks, done.
>
> Jason
No problem, always fun to have some code that you actually understand
the idea behind and see how it's implemented.
/Johan Bucht
More information about the freebsd-current
mailing list