New libc malloc patch

Tue Nov 29 21:08:55 GMT 2005

Jon,

Thanks for your comments.  Fortunately, I don't think things are  
quite as hopeless as you think, though you may be right that some  
adjustments are necessary.  Specific replies follow.

On Nov 29, 2005, at 12:06 PM, Jon Dama wrote:
> There exists a problem right now--localized to i386 and any other arch
> based on 32-bit pointers: address space is simply too scarce.
>
> Your decision to switch to using mmap as the exclusive source of  
> malloc
> buckets is admirable for its modernity but it simply cannot stand  
> unless
> someone steps up to change the way mmap and brk interact within the
> kernel.
>
> The trouble arises from the need to set MAXDSIZ and the resulting  
> effect
> it has in determining the start of the mmap region--which I might  
> add is
> the location that the shared library loader is placed.  This  
> effectively
> (and explicitly) sets the limit for how large of a contiguous  
> region can
> be allocated with brk.
>
> What you've done by switching the system malloc to exclusively using
> mmap is induced a lot of motivation on the part of the sysadmin to  
> push
> that brk/mmap boundary down.
>
> This wouldn't be a problem except that you've effectively shot in  
> the foot
> dozens of alternative c malloc implementations, not to mention the  
> memory
> allocator routines used in obscure languages such as Modula-3 and  
> Haskell
> that rely on brk derived buckets.
>
> This isn't playing very nicely!

Where should MAXDSIZ be?  Given scarce address space, the best we can  
hope for is setting it to the "least bad" default, as measured by  
what programs we care about do.  No matter what we do, some programs  
lose.

That said, it turns out that adding the ability to allocate via brk  
isn't hard.  The code already contains the logic to recycle address  
ranges, in order to reduce mmap system call overhead.  All of the  
locking infrastructure is also already in place.  The only necessary  
modifications are 1) explicit use of brk until all data segment space  
is consumed, 2) special case code for addresses in the brk range, so  
that madvise() is used instead of munmap(), and 3) preferential re- 
use of the brk address space over mmap'ed memory.

Do you agree that there is no need for using brk on 64-bit systems?

> I looked into the issues and limitations with phkmalloc several  
> months ago
> and concluded that simply adopting ptmalloc2 (the linux malloc) was  
> the
> better approach--notably it is willing to draw from both brk and  
> mmap, and
> it also implements per-thread arenas.

ptmalloc takes a different approach to per-thread arenas that has  
been shown in multiple papers to not scale as well as the approach I  
took.  The difference isn't significant until you get to 8+ CPUs, but  
we already have systems running with enough CPUs that this is an issue.

> There is also cause for concern about your "cache-line" business.   
> Simply
> on the face of it there is the problem that the scheduler does not  
> do a
> good job of pinning threads to individual CPUs.  The threads are  
> already
> bounding from cpu to cpu and thrashing (really thrashing) each CPU  
> cache
> along the way.
>
> Second, you've forgotten that there is a layer of indirection  
> between your
> address space and the cache: the mapping of logical pages (what you  
> can
> see in userspace) to physical pages (the addresses of which actually
> matter for the purposes of the cache).  I don't recall off-hand  
> whether or
> not the L1 cache on i386 is based on tags of the virtual addresses,  
> but I
> am certain that the L2 and L3 caches tag the physical addresses not  
> the
> virtual addresses.
>
> This means that your careful address selection based on cache-lines  
> will
> only work out if it is done in the vm codepath: remember the  
> mapping of
> physical addresses to the virtual addresses that come back from  
> mmap can
> be delayed arbitrarily long into the future depending on when the  
> program
> actually goes to touch that memory.
>
> Furthermore, the answer may vary depending on the architecture or  
> even the
> processor version.

I don't think you understand the intent for the "cache-line  
business".  There is only one intention: avoid storing data  
structures in the same cache line if they are likely to be accessed  
simultaneously by multiple threads.  For example, if two independent  
structures are stored right next to each other, although they do not  
require synchronization protection from each other, the hardware will  
send a cache line invalidation message to other CPUs every time  
anything in the cache line is modified.  This means horrible cache  
performance if the data are modified often.

As a particular example, there are per-arena data structures that are  
modified quite often (arena_t).  If two arenas were right next to  
each other, then they could share a cache line, and performance would  
potentially be severly impacted.

|---arena_t---|---arena_t---|
|   |   |   |   |   |   |   |
              ^^^
              BAD!

Thanks,
Jason