New jemalloc patch (was Re: KDE 3.5.0 seems much chubbier than 3.4.2)

Tue Mar 7 06:52:01 UTC 2006

Daniel O'Connor wrote:
> On Friday 17 February 2006 12:31, Jason Evans wrote:
>>Can you tell me which programs are particularly bad, and if you are
>>using them in any particular ways that are important to reproducing the
>>high memory usage?  I don't generally use KDE, so any details you
>>provide are likely to help.
> 
> Hmm, well it seems XOrg, Amarok, Kopete, Konqueror and KMail show up as big 
> users.
> 
> I have a largish MP3 collection (~7000 songs) loaded into Amarok, I have set 
> the number of history items in Kopete to be 250, Konqueror has about 10 tabs 
> open and KMail is setup to used cached-imap with my email accounts (162 
> folders, ~10000 messages)
> 
> With phkmalloc I am seeing Xorg use 110M/80M (size/res), amarok uses 93M/73M, 
> Kopete uses 82M/56M, Konqueror uses 81M/68M, and KMail uses 68M/53M.
> 
> With jemalloc I saw Xorg use 213M/50M, amarok - 213M/50M, Kopete 119M/7.3M, 
> Konq - 260M/67M (guessed), and KMail - 137M/51M.

[Short summary: jemalloc fragments memory badly under some conditions. 
I've re-architected jemalloc to fix this, and a patch is available for 
the brave (URL below).]

I've looked into this in some detail, and have determined that KDE apps 
exhibit an allocation pattern that causes jemalloc to fragment memory 
somewhat badly.  In brief, KDE apps are allocating moderate numbers of 
objects across a broad spectrum of sizes, throughout the app lifetimes. 
  Evenly mixing objects of many different sizes together tends to make 
later coalescence and re-use of memory more difficult.  The following is 
a summary of konqueror's allocation activity (I opened 8 tabs to various 
websites, clicked around a bit, then exited):

___ Begin malloc statistics ___
Number of CPUs: 1
Number of arenas: 1
Chunk size: 2097152 (2^21)
Quantum size: 16 (2^4)
Max small size: 512 Pointer size: 4
Assertions enabled
Allocated: 4290957716, space used: 48234496
            ^^^ Stats bug
chunks:
        nchunks   highchunks    curchunks
             80           26           23

huge:
       nmalloc      ndalloc    allocated
             2            2            0

arenas[0] statistics:
calls:
        nmalloc      npalloc      ncalloc      ndalloc      nralloc
        7512580            0      4962691     22362166      9906573
bins:
       bin   size nregs run_size    nrequests      nruns highruns curruns
         0 T    2  1906     4096        27416          1        1       1
         1 T    4   953     4096       391123          9        9       9
         2 T    8   988     8192       283348         21       16      14
         3 Q   16  1006    16384      3384383        291      257     188
         4 Q   32  1015    32768      8044531        232      207     153
         5 Q   48  1359    65536       705179         46       45      33
         6 Q   64  1019    65536       597042         31       26      22
         7 Q   80  1634   131072       417756         12       12       9
         8 Q   96  1362   131072       370886          3        3       3
         9 Q  112  1167   131072       368970          5        5       4
        10 Q  128  1021   131072       387485          3        3       3
        11 Q  144  1818   262144       339876          1        1       1
        12 Q  160  1636   262144       334701          1        1       1
        13 Q  176  1487   262144       335299          2        2       2
        14 Q  192  1363   262144       332837          1        1       1
        15 Q  208  1258   262144       351467          2        2       2
        16 Q  224  1169   262144       345241          1        1       1
        17 Q  240  1091   262144       332729          1        1       1
        18 Q  256  1022   262144       359467          1        1       1
        19 Q  272  1926   524288       331321          1        1       1
        20 Q  288  1819   524288       330817          1        1       1
        21 Q  304  1723   524288       331117          1        1       1
        22 Q  320  1637   524288       328254          1        1       1
        23 Q  336  1559   524288       320970          1        1       1
        24 Q  352  1488   524288       308898          1        1       1
        25 Q  368  1423   524288       254787          1        1       1
        26 Q  384  1364   524288       110447          1        1       1
        27 Q  400  1310   524288       112540          1        1       1
        28 Q  416  1259   524288       109898          1        1       1
        29 Q  432  1212   524288       109902          1        1       1
        30 Q  448  1169   524288       110175          1        1       1
        31 Q  464  1129   524288       109667          1        1       1
        32 Q  480  1091   524288       110484          1        1       1
        33 Q  496  1056   524288       110610          1        1       1
        34 Q  512  1023   524288       111992          1        1       1
        35 P 1024  1023  1048576      1406248          1        1       1
        36 P 2048   511  1048576        24986          1        1       1
--- End malloc statistics ---

The 'nrequests' column indicates the total number of requests for each 
size class.  The somewhat even distribution from 96..512 bytes (actually 
..~750 bytes, as discovered in other tests) is rather unusual.  jemalloc 
simply does not deal well with this.

This, in combination with some interesting ideas I came across while 
writing a malloc paper for BSDcan 2006, caused me to prototype a very 
different architecture that what jemalloc currently uses.  The prototype 
looked promising, so I've polished it up.  In short, this new jemalloc 
has the following characteristics:

* Still uses chunks, managed by multiple arenas.  Thus, the SMP 
scalability should be unchanged.

* Carves chunks into page "runs" using a power-of-two buddy allocation 
scheme.

* Allocates only one size class from each run, and uses a bitmap to 
track which regions are in use, so that regions can be tightly packed.

* Auto-sizes runs so that the proportion of data structure overhead is 
quite low, even for the largest sub-pagesize size classes (see 
'run_size' column in above stats output).

* Uses power-of-two size classes below the quantum (quantum is 16 bytes 
for most architectures), quantum-spaced size classes up to a threshold 
(512 bytes by default), and power-of-two size classes above the 
threshold (up to 1/2 page).  Runs are used directly for allocations that 
are larger than 1/2 page, but no larger than 1/2 chunk.  Above 1/2 
chunk, a multiple of the chunk size is used.

I've been running variants of this allocator for over a week now, and it 
  appears to be pretty solid.  Its speed appears to be good.  Memory 
usage is much improved, with one exception: small apps tend to fault in 
a few more pages than before, since even a single allocation of a size 
class causes a page to be faulted in.  This amounts to a bounded 
constant overhead that goes away as an application's memory usage 
increases to fill those pages.

A patch is available at:

http://people.freebsd.org/~jasone/jemalloc/patches/jemalloc_20060306a.diff

See the man page in the patch for runtime configuration options.

Here's the C file:

http://people.freebsd.org/~jasone/jemalloc/patches/jemalloc_20060306a.c

I'm primarily interested in feedback regarding the following:

* Stability
* Speed
* Memory usage

Thanks,
Jason