New libc malloc patch

Sun Dec 11 17:48:57 PST 2005

Ugh, forgot that I used another smtp server than the one subscribed to
the list. So the mail shouldn't have made it to the list, if it has I 
apologize. Remembered one more thing about locking granularity too.

Just a few comments, not sure if all of them are valid since I don't
have a recent enough system to apply the patch and test it, so my
comments are just based on reading the patch. I might have missed some
thing hidden between all the lines differing. Can you put up a patched
malloc.c somewhere?

I made a proof-of-concept parallel allocator for my Master's Thesis
during the spring and some of the comments are about differences and
similarities between our implementations.

* Fork safety functions
Nice to have for all allocators and is something I missed having. Would
like to have them regardless if your malloc becomes standard or not.

* magazines, per-arena caches
If I didn't miss something you don't seem to have it, pardon if I missed
them in my quick readings. They can really help the scalability and is
something that is implemented in Solaris libumem allocator with great
success. I implemented them and they helped bring the scalability up
quite a bit on the system I tested my allocator on aswell as bringing
the allocation latency down.

* Saw you used a tree to look up the allocation sizes at free, this
* differs from how I and Solaris libumem does it. Instead I went for
* prepending a 8/16 byte tag containing the size of the allocation for
* lockless size lookup.

* Bucket sizes
If I didn't miss something you are using power of two sizes. Might be
better to use smaller sizes to avoid internal fragmentation, aswell as
improving locking granularity.

* Locking granularity
You use a single lock for the chunk allocation instead of a lock per
bucket size or something similar. This means that you limit access
unless threads allocate from their private arenas. Since you hash to get
an arena there might be multiple threads accessing the same arena at the
same time introducing contention. Might be better to have a lock per
allocation size to somewhat improve the situation.

* Locking primitive
The biggest issue and as David Xu pointed out is probably the locking
primitives. The SPINLOCK use has a limit in the threading library and
makes is hard to have a lot of mutexes. I ended up using a wrapper
around the umtx_lock function to get recursive mutexes and it would
probably be better to extend the umtx functions to handle recursion.
This would probably also be appreciated by other malloc implementations.
Might be interesting to implement some of the ideas from the Linux futex
implementation to help umtx.

* sbrk vs mmap
See you added the ability to try brk first and try mmap if brk fails.
This is similar to my implementation aswell as ptmalloc. Looks good.

* Big allocations
It might be preferable to use mmap/munmap directly for big allocations
(bigger than a few pages) to avoid the overhead of trying to be smart.
These big allocations should be pretty few and shouldn't be that
pessimized by this. It also means some memory savings as you can
directly free memory from big allocations.

* Ownership
Didn't look that much for it so might have missed it. Basicly what
happens if you free memory allocated in another arena, do you return
memory to the arena the memory was allocated from or to the current
threads arena? Not returning it to the original arena leads to cache
line sharing. Returning it to the original adds some overhead and it
might be argued that applications doing this might not be considered
scalable and should be fixed instead.

* Standalone version
Would be nice to have a standalone version to test the allocator on
solaris and linux to compare against their allocators. Would be nice for
all the people with only access to SMP machines running another OS. I
tested my allocator on both Solaris 9 and Freebsd 5/6 and it was both
helpful in testing scalability and impact on small machines aswell as
finding bugs in the implementation. I tested my allocator against
Solaris libc, Hoard and libumem on a 4-cpu Solaris 9 machine (4x 
UltraSparc IIIi, 1600MHz, 16384MB RAM) at the university and tested my 
allocator vs the FreeBSD libc on my 400MHz laptop with too much beer 
and almost non working cpu fan. Beer as in standing in 10cm of beer in
my backpack =).
Not sure how much of the implementation is FreeBSD dependant, RB-trees?,
Solaris 9 has the benefit of using a local malloc inside libc which
makes the pthread library functions accessible from the malloc
implementation. I avoided touching the libc malloc as I really didn't
care at all about single threaded apps and simply used LD_PRELOAD to
override libc malloc. Basicly I throw away memory if you allocate a
multiple of the page size since I add a tag prepending the memory and I
allocate whole pages. So a 8192 byte allocation becomes a 8200 byte
allocation, wasting a page.

* Focus - Single thread vs multithread
What are the focus of the allocator, how much memory is it worth to
waste to implement a good scalable implementation. This is the reason I
think Solaris still keeps a serial allocator in libc and makes libumem
accessible through LD_PRELOAD and friends.

* Benchmarks
Would be really nice with a few benchmarks against phkmalloc, ptmalloc,
hoard and libumem to see how well it stands. Pure local arena allocation
apps, Larson benchmarks and some other. Might also be good to see how
much more memory it consumes compared to phkmalloc.

* Comment
Might be wise to fix the comment about always using mmap since brk can
be used.

* Conclussion
Despite not having tested it it looks good and would be nice to have it
in FreeBSD. Some parts are really independent on the allocator and would
be nice to have, regardless of the general consensus.

-- 
/Johan Bucht
bucht at acc.umu.se