New malloc ready, take 42

Fri Dec 23 12:07:39 PST 2005

On Dec 23, 2005, at 2:28 AM, David Xu wrote:
> I know what '>' does in phkmalloc. I found 'Q', he replaced '>' with
> 'Q', this is really strange to me. ;-)

Actually, the closest analog to phkmalloc's '>' and '<' are 'C' and  
'c'.  However, they don't have quite the same meaning, so I thought  
that changing the designators was appropriate.  Here's a snippet from  
the man page about some of the performance tuning flags supported by  
jemalloc:

  C Increase/decrease the size of the cache by a factor of two.  The
    default cache size is 256 objects for each arena.	This option
    can be specified multiple times.

  N Increase/decrease the number of arenas by a factor of two.  The
    default number of arenas is twice the number of CPUs, or one if
    there is a single CPU.  This option can be specified multiple
    times.

  Q Increase/decrease the size of the allocation quantum by a factor
    of two.  The default quantum is the minimum allowed by the archi-
    tecture (typically 8 or 16 bytes).  This option can be specified
    multiple times.

The implications of each of these flags is described in some detail  
later in the man page:

  This allocator uses multiple arenas in order to reduce lock contention
  for threaded programs on multi-processor systems.  This works well  
with
  regard to threading scalability, but incurs some costs.  There is a  
small
  fixed per-arena overhead, and additionally, arenas manage memory com-
  pletely independently of each other, which means a small fixed  
increase
  in overall memory fragmentation.  These overheads aren't generally an
  issue, given the number of arenas normally used.  Note that using sub-
  stantially more arenas than the default is not likely to improve  
perfor-
  mance, mainly due to reduced cache performance.  However, it may make
  sense to reduce the number of arenas if an application does not  
make much
  use of the allocation functions.

  This allocator uses a novel approach to object caching.  For objects
  below a size threshold (use the ``P'' option to discover the  
threshold),
  full deallocation and attempted coalescence with adjacent memory  
regions
  are delayed.  This is so that if the application requests an  
allocation
  of that size soon thereafter, the request can be met much more  
quickly.
  Most applications heavily use a small number of object sizes, so this
  caching has the potential to have a large positive performance impact.
  However, the effectiveness of the cache depends on the cache being  
large
  enough to absorb typical fluctuations in the number of allocated  
objects.
  If an application routinely fluctuates by thousands of objects,  
then it
  may make sense to increase the size of the cache.  Conversely, if an
  application's memory usage fluctuates very little, it may make  
sense to
  reduce the size of the cache, so that unused regions can be coalesced
  sooner.

  This allocator is very aggressive about tightly packing objects in  
mem-
  ory, even for objects much larger than the system page size.  For pro-
  grams that allocate objects larger than half the system page size,  
this
  has the potential to reduce memory footprint in comparison to other  
allo-
  cators.  However, it has some side effects that are important to  
keep in
  mind.  First, even multi-page objects can start at non-page-aligned
  addresses, since the implementation only guarantees quantum alignment.
  Second, this tight packing of objects can cause objects to share L1  
cache
  lines, which can be a performance issue for multi-threaded  
applications.
  There are two ways to approach these issues.  First, posix_memalign()
  provides the ability to align allocations as needed.  By aligning an
  allocation to at least the L1 cache line size, and padding the  
allocation
  request by one L1 cache line unit, the programmer can rest assured  
that no
  cache line sharing will occur for the object.  Second, the ``Q''  
option
  can be used to force all allocations to be aligned with the L1 cache
  lines.  This approach should be used with care though, because  
although
  easy to implement, it means that all allocations must be at least as
  large as the quantum, which can cause severe internal fragmentation if
  the application allocates many small objects.

Jason