Reason for doing malloc / bzero over calloc (performance)?
youshi10 at u.washington.edu
youshi10 at u.washington.edu
Fri Jun 15 01:10:53 UTC 2007
On Thu, 14 Jun 2007, Matthew Dillon wrote:
> I'm going to throw a wrench in the works, because it all gets turned
> around the moment you find yourself in a SMP environment where several
> threads are running on different cpus at the same time, using the
> same shared VM space.
>
> The moment you have a situation like that where you are futzing with
> the page tables, i.e. using mmap() for demand-zero and munmap() to
> free, the operation becomes extremely expensive verses anything
> else because any update to the page table (specifically any removal
> of page table entries from the page table) requires a SMP synchronization
> to occur between all the cpu's actively sharing that VM space, and
> that's on top of the overhead of taking the page fault(s).
>
> This is true of any memory mapping the kernel has to do in kernel
> virtual memory (must be synchronized with ALL cpus) and any mapping
> the kernel does on behalf of userland for user memory (must be
> synchronized with any cpu's actively using that VM space, i.e. threaded
> user programs). The synchronization is required to properly invalidate
> stale mappings on other cpus and it must be done synchronously due
> to bugs in Intel/AMD related to changing page table entries on one
> cpu when instructions are executing using that memory on another cpu.
> There is no way to avoid it without tripping up on the Intel/AMD hardware
> bugs.
>
> From this point of view it is much, much better to bzero() memory that
> is already mapped then it is to map/unmap new memory. I recently
> audited DragonFly and found an insane number of IPIs flying about due
> to PAGE_SIZE'd kernel mallocs using the VM trick via kernel_map &
> kmem_alloc(). They all went away when I made the kernel malloc use
> the slab cache for allocations up to and including PAGE_SIZE*2 bytes.
>
> Fun, eh?
>
> -Matt
> Matthew Dillon
> <dillon at backplane.com>
I have no intention of using malloc/calloc with free, and then repeating the same procedure. It's better just to use the memory allocated, if possible, size permitting this.
I wasn't thinking that closely though (ISA/hardware config versus OS implementation), but I had my suspicions since the AMD64 architecture is very different from the PowerPC architecture, in terms of word size, sychronization schemes, instruction count, etc.
Interesting insight though. Thanks :).
-Garrett
More information about the freebsd-hackers
mailing list