Reason for doing malloc / bzero over calloc (performance)?

Fri Jun 15 01:04:34 UTC 2007

    I'm going to throw a wrench in the works, because it all gets turned
    around the moment you find yourself in a SMP environment where several
    threads are running on different cpus at the same time, using the 
    same shared VM space.

    The moment you have a situation like that where you are futzing with
    the page tables, i.e. using mmap() for demand-zero and munmap() to
    free, the operation becomes extremely expensive verses anything
    else because any update to the page table (specifically any removal
    of page table entries from the page table) requires a SMP synchronization
    to occur between all the cpu's actively sharing that VM space, and
    that's on top of the overhead of taking the page fault(s).

    This is true of any memory mapping the kernel has to do in kernel
    virtual memory (must be synchronized with ALL cpus) and any mapping
    the kernel does on behalf of userland for user memory (must be
    synchronized with any cpu's actively using that VM space, i.e. threaded
    user programs).  The synchronization is required to properly invalidate
    stale mappings on other cpus and it must be done synchronously due
    to bugs in Intel/AMD related to changing page table entries on one
    cpu when instructions are executing using that memory on another cpu.
    There is no way to avoid it without tripping up on the Intel/AMD hardware
    bugs.

    From this point of view it is much, much better to bzero() memory that
    is already mapped then it is to map/unmap new memory.  I recently
    audited DragonFly and found an insane number of IPIs flying about due
    to PAGE_SIZE'd kernel mallocs using the VM trick via kernel_map &
    kmem_alloc().  They all went away when I made the kernel malloc use
    the slab cache for allocations up to and including PAGE_SIZE*2 bytes.

    Fun, eh?

					-Matt
					Matthew Dillon 
					<dillon at backplane.com>