Reason for doing malloc / bzero over calloc (performance)?

Fri Jun 15 01:10:53 UTC 2007

On Thu, 14 Jun 2007, Matthew Dillon wrote:

>    I'm going to throw a wrench in the works, because it all gets turned
>    around the moment you find yourself in a SMP environment where several
>    threads are running on different cpus at the same time, using the
>    same shared VM space.
>
>    The moment you have a situation like that where you are futzing with
>    the page tables, i.e. using mmap() for demand-zero and munmap() to
>    free, the operation becomes extremely expensive verses anything
>    else because any update to the page table (specifically any removal
>    of page table entries from the page table) requires a SMP synchronization
>    to occur between all the cpu's actively sharing that VM space, and
>    that's on top of the overhead of taking the page fault(s).
>
>    This is true of any memory mapping the kernel has to do in kernel
>    virtual memory (must be synchronized with ALL cpus) and any mapping
>    the kernel does on behalf of userland for user memory (must be
>    synchronized with any cpu's actively using that VM space, i.e. threaded
>    user programs).  The synchronization is required to properly invalidate
>    stale mappings on other cpus and it must be done synchronously due
>    to bugs in Intel/AMD related to changing page table entries on one
>    cpu when instructions are executing using that memory on another cpu.
>    There is no way to avoid it without tripping up on the Intel/AMD hardware
>    bugs.
>
>    From this point of view it is much, much better to bzero() memory that
>    is already mapped then it is to map/unmap new memory.  I recently
>    audited DragonFly and found an insane number of IPIs flying about due
>    to PAGE_SIZE'd kernel mallocs using the VM trick via kernel_map &
>    kmem_alloc().  They all went away when I made the kernel malloc use
>    the slab cache for allocations up to and including PAGE_SIZE*2 bytes.
>
>    Fun, eh?
>
> 					-Matt
> 					Matthew Dillon
> 					<dillon at backplane.com>

I have no intention of using malloc/calloc with free, and then repeating the same procedure. It's better just to use the memory allocated, if possible, size permitting this.

I wasn't thinking that closely though (ISA/hardware config versus OS implementation), but I had my suspicions since the AMD64 architecture is very different from the PowerPC architecture, in terms of word size, sychronization schemes, instruction count, etc.

Interesting insight though. Thanks :).

-Garrett