Memory allocation performance/statistics patches

Mon Apr 25 10:10:05 PDT 2005

On Mon, 25 Apr 2005, Robert Watson wrote:

> I now have updated versions of these patches, which correct some 
> inconsistencies in approach (universal use of curcpu now, for example), 
> remove some debugging code, etc.  I've received relatively little 
> performance feedback on them, and would appreciate it if I could get 
> some. :-) Especially as to whether these impact disk I/O related 
> workloads, useful macrobenchmarks, etc.  The latest patch is at:
>
> 
> http://www.watson.org/~robert/freebsd/netperf/20050425-uma-mbuf-malloc-critical.diff

FYI: For those set up to track perforce, you can find the contents of this 
patch in:

     //depot/user/rwatson/percpu/...

In addition, that branch also contains diagnostic micro-benchmarks in the 
kernel to measure the cost of various synchronization operations, memory 
allocation operations, etc, which can be queried using "sysctl test".

Robert N M Watson

>
> The changes in the following files in the combined patch are intended to be 
> broken out in to separate patches, as desired, as follows:
>
> kern_malloc.c		malloc.diff
> kern_mbuf.c		mbuf.diff
> uipc_mbuf.c		mbuf.diff
> uipc_syscalls.c		mbuf.diff
> malloc.h		malloc.diff
> mbuf.h			mbuf.diff
> pcpu.h			malloc.diff, mbuf.diff, uma.diff
> uma_core.c		uma.diff
> uma_int.h		uma.diff
>
> I.e., the pcpu.h changes are a dependency for all of the remaining changes. 
> As before, I'm interested in both the impact of individual patches, and the 
> net effect of the total change associated with all patches applied.
>
> Because this diff was generated by p4, patch may need some help in 
> identifying the targets of each part of the diff.
>
> Robert N M Watson
>
> On Sun, 17 Apr 2005, Robert Watson wrote:
>
>> 
>> Attached please find three patches:
>> 
>> (1) uma.diff, which modifies the UMA slab allocator to use critical
>>    sections instead of mutexes to protect per-CPU caches.
>> 
>> (2) malloc.diff, which modifies the malloc memory allocator to use
>>    critical sections and per-CPU data instead of mutexes to store
>>    per-malloc-type statistics, coalescing for the purposes of the sysctl
>>    used to generate vmstat -m output.
>> 
>> (3) mbuf.diff, which modifies the mbuf allocator to use per-CPU data and
>>    critical sections for statistics, instead of synchronization-free
>>    statistics which could result in substantial inconsistency on SMP
>>    systems.
>> 
>> These changes are facilitated by John Baldwin's recent re-introduction of 
>> critical section optimizations that permit critical sections to be 
>> implemented "in software", rather than using the hardware interrupt disable 
>> mechanism, which is quite expensive on modern processors (especially Xeon 
>> P4 CPUs).  While not identical, this is similar to the softspl behavior in 
>> 4.x, and Linux's preemption disable mechanisms (and various other post-Vax 
>> systems :-)).
>> 
>> The reason this is interesting is that it allows synchronization of per-CPU 
>> data to be performed at a much lower cost than previously, and consistently 
>> across UP and SMP systems.  Prior to these changes, the use of critical 
>> sections and per-CPU data as an alternative to mutexes would lead to an 
>> improvement on SMP, but not on UP.  So, that said, here's what I'd like us 
>> to look at:
>> 
>> - Patches (1) and (2) are intended to improve performance by reducing the
>>  overhead of maintaining cache consistency and statistics for UMA and
>>  malloc(9), and may universally impact performance (in a small way) due
>>  to the breadth of their use through the kernel.
>> 
>> - Patch (3) is intended to restore consistency to statistics in the
>>  presence of SMP and preemption, at the possible cost of some
>>  performance.
>> 
>> I'd like to confirm that for the first two patches, for interesting 
>> workloads, performance generally improves, and that stability doesn't 
>> degrade.  For the third partch, I'd like to quantify the cost of the 
>> changes for interesting workloads, and likewise confirm no loss of 
>> stability.
>> 
>> Because these will have a relatively small impact, a fair amount of caution 
>> is required in testing.  We may be talking about a percent or two, maybe 
>> four, difference in benchmark performance, and many benchmarks have a 
>> higher variance than that.
>> 
>> A couple of observations for those interested:
>> 
>> - The INVARIANTS panic with UMA seen in some earlier patch versions is
>>  believed to be corrected.
>> 
>> - Right now, because I use arrays of foo[MAXCPUS], I'm concerned that
>>  different CPUs will be writing to the same cache line as they're
>>  adjacent in memory.  Moving to per-CPU chunks of memory to hold this
>>  stuff is desirable, but I think first we need to identify a model by
>>  which to do that cleanly.  I'm not currently enamored of the 'struct
>>  pcpu' model, since it makes us very sensitive to ABI changes, as well as
>>  not offering a model by which modules can register new per-cpu data
>>  cleanly.  I'm also inconsistent about how I dereference into the arrays,
>>  and intend to move to using 'curcpu' throughout.
>> 
>> - Because mutexes are no longer used in UMA, and not for the others
>>  either, stats read across different CPUs that are coalesced may be
>>  slightly inconsistent.  I'm not all that concerned about it, but it's
>>  worth thinking on.
>> 
>> - Malloc stats for realloc() are still broken if you apply this patch.
>> 
>> - High watermarks are no longer maintained for malloc since they require a
>>  global notion of "high" that is tracked continuously (i.e., at each
>>  change), and there's no longer a global view except when the observer
>>  kicks in (sysctl).  You can imagine various models to restore some
>>  notion of a high watermark, but I'm not currently sure which is the
>>  best.  The high watermark notion is desirable though.
>> 
>> So this is a request for:
>> 
>> (1) Stability testing of these patches.  Put them on a machine, make them
>>    hurt.  If things go South, try applying the patches one by one until
>>    it's clear which is the source.
>> 
>> (2) Performance testing of these patches.  Subject to the challenges in
>>    testing them.  If you are interested, please test each patch
>>    separately to evaluate its impact on your system.  Then apply all
>>    together and see how it evens out.  You may find that the mbuf
>>    allocator patch outweighs the benefits of the other two patches, if
>>    so, that is interesting and something to work on!
>> 
>> I've done some micro-benchmarking using tools like netblast, 
>> syscall_timing, etc, but I'm interested particularly in the impact on 
>> macrobenchmarks.
>> 
>> Thanks!
>> 
>> Robert N M Watson
> _______________________________________________
> freebsd-performance at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-performance
> To unsubscribe, send any mail to 
> "freebsd-performance-unsubscribe at freebsd.org"
>