New malloc ready, take 42
Jason Evans
jasone at freebsd.org
Fri Dec 23 12:07:39 PST 2005
On Dec 23, 2005, at 2:28 AM, David Xu wrote:
> I know what '>' does in phkmalloc. I found 'Q', he replaced '>' with
> 'Q', this is really strange to me. ;-)
Actually, the closest analog to phkmalloc's '>' and '<' are 'C' and
'c'. However, they don't have quite the same meaning, so I thought
that changing the designators was appropriate. Here's a snippet from
the man page about some of the performance tuning flags supported by
jemalloc:
C Increase/decrease the size of the cache by a factor of two. The
default cache size is 256 objects for each arena. This option
can be specified multiple times.
N Increase/decrease the number of arenas by a factor of two. The
default number of arenas is twice the number of CPUs, or one if
there is a single CPU. This option can be specified multiple
times.
Q Increase/decrease the size of the allocation quantum by a factor
of two. The default quantum is the minimum allowed by the archi-
tecture (typically 8 or 16 bytes). This option can be specified
multiple times.
The implications of each of these flags is described in some detail
later in the man page:
This allocator uses multiple arenas in order to reduce lock contention
for threaded programs on multi-processor systems. This works well
with
regard to threading scalability, but incurs some costs. There is a
small
fixed per-arena overhead, and additionally, arenas manage memory com-
pletely independently of each other, which means a small fixed
increase
in overall memory fragmentation. These overheads aren't generally an
issue, given the number of arenas normally used. Note that using sub-
stantially more arenas than the default is not likely to improve
perfor-
mance, mainly due to reduced cache performance. However, it may make
sense to reduce the number of arenas if an application does not
make much
use of the allocation functions.
This allocator uses a novel approach to object caching. For objects
below a size threshold (use the ``P'' option to discover the
threshold),
full deallocation and attempted coalescence with adjacent memory
regions
are delayed. This is so that if the application requests an
allocation
of that size soon thereafter, the request can be met much more
quickly.
Most applications heavily use a small number of object sizes, so this
caching has the potential to have a large positive performance impact.
However, the effectiveness of the cache depends on the cache being
large
enough to absorb typical fluctuations in the number of allocated
objects.
If an application routinely fluctuates by thousands of objects,
then it
may make sense to increase the size of the cache. Conversely, if an
application's memory usage fluctuates very little, it may make
sense to
reduce the size of the cache, so that unused regions can be coalesced
sooner.
This allocator is very aggressive about tightly packing objects in
mem-
ory, even for objects much larger than the system page size. For pro-
grams that allocate objects larger than half the system page size,
this
has the potential to reduce memory footprint in comparison to other
allo-
cators. However, it has some side effects that are important to
keep in
mind. First, even multi-page objects can start at non-page-aligned
addresses, since the implementation only guarantees quantum alignment.
Second, this tight packing of objects can cause objects to share L1
cache
lines, which can be a performance issue for multi-threaded
applications.
There are two ways to approach these issues. First, posix_memalign()
provides the ability to align allocations as needed. By aligning an
allocation to at least the L1 cache line size, and padding the
allocation
request by one L1 cache line unit, the programmer can rest assured
that no
cache line sharing will occur for the object. Second, the ``Q''
option
can be used to force all allocations to be aligned with the L1 cache
lines. This approach should be used with care though, because
although
easy to implement, it means that all allocations must be at least as
large as the quantum, which can cause severe internal fragmentation if
the application allocates many small objects.
Jason
More information about the freebsd-current
mailing list