New libc malloc patch
Jason Evans
jasone at canonware.com
Tue Nov 29 21:08:55 GMT 2005
Jon,
Thanks for your comments. Fortunately, I don't think things are
quite as hopeless as you think, though you may be right that some
adjustments are necessary. Specific replies follow.
On Nov 29, 2005, at 12:06 PM, Jon Dama wrote:
> There exists a problem right now--localized to i386 and any other arch
> based on 32-bit pointers: address space is simply too scarce.
>
> Your decision to switch to using mmap as the exclusive source of
> malloc
> buckets is admirable for its modernity but it simply cannot stand
> unless
> someone steps up to change the way mmap and brk interact within the
> kernel.
>
> The trouble arises from the need to set MAXDSIZ and the resulting
> effect
> it has in determining the start of the mmap region--which I might
> add is
> the location that the shared library loader is placed. This
> effectively
> (and explicitly) sets the limit for how large of a contiguous
> region can
> be allocated with brk.
>
> What you've done by switching the system malloc to exclusively using
> mmap is induced a lot of motivation on the part of the sysadmin to
> push
> that brk/mmap boundary down.
>
> This wouldn't be a problem except that you've effectively shot in
> the foot
> dozens of alternative c malloc implementations, not to mention the
> memory
> allocator routines used in obscure languages such as Modula-3 and
> Haskell
> that rely on brk derived buckets.
>
> This isn't playing very nicely!
Where should MAXDSIZ be? Given scarce address space, the best we can
hope for is setting it to the "least bad" default, as measured by
what programs we care about do. No matter what we do, some programs
lose.
That said, it turns out that adding the ability to allocate via brk
isn't hard. The code already contains the logic to recycle address
ranges, in order to reduce mmap system call overhead. All of the
locking infrastructure is also already in place. The only necessary
modifications are 1) explicit use of brk until all data segment space
is consumed, 2) special case code for addresses in the brk range, so
that madvise() is used instead of munmap(), and 3) preferential re-
use of the brk address space over mmap'ed memory.
Do you agree that there is no need for using brk on 64-bit systems?
> I looked into the issues and limitations with phkmalloc several
> months ago
> and concluded that simply adopting ptmalloc2 (the linux malloc) was
> the
> better approach--notably it is willing to draw from both brk and
> mmap, and
> it also implements per-thread arenas.
ptmalloc takes a different approach to per-thread arenas that has
been shown in multiple papers to not scale as well as the approach I
took. The difference isn't significant until you get to 8+ CPUs, but
we already have systems running with enough CPUs that this is an issue.
> There is also cause for concern about your "cache-line" business.
> Simply
> on the face of it there is the problem that the scheduler does not
> do a
> good job of pinning threads to individual CPUs. The threads are
> already
> bounding from cpu to cpu and thrashing (really thrashing) each CPU
> cache
> along the way.
>
> Second, you've forgotten that there is a layer of indirection
> between your
> address space and the cache: the mapping of logical pages (what you
> can
> see in userspace) to physical pages (the addresses of which actually
> matter for the purposes of the cache). I don't recall off-hand
> whether or
> not the L1 cache on i386 is based on tags of the virtual addresses,
> but I
> am certain that the L2 and L3 caches tag the physical addresses not
> the
> virtual addresses.
>
> This means that your careful address selection based on cache-lines
> will
> only work out if it is done in the vm codepath: remember the
> mapping of
> physical addresses to the virtual addresses that come back from
> mmap can
> be delayed arbitrarily long into the future depending on when the
> program
> actually goes to touch that memory.
>
> Furthermore, the answer may vary depending on the architecture or
> even the
> processor version.
I don't think you understand the intent for the "cache-line
business". There is only one intention: avoid storing data
structures in the same cache line if they are likely to be accessed
simultaneously by multiple threads. For example, if two independent
structures are stored right next to each other, although they do not
require synchronization protection from each other, the hardware will
send a cache line invalidation message to other CPUs every time
anything in the cache line is modified. This means horrible cache
performance if the data are modified often.
As a particular example, there are per-arena data structures that are
modified quite often (arena_t). If two arenas were right next to
each other, then they could share a cache line, and performance would
potentially be severly impacted.
|---arena_t---|---arena_t---|
| | | | | | | |
^^^
BAD!
Thanks,
Jason
More information about the freebsd-current
mailing list