Memory allocation performance/statistics patches
Robert Watson
rwatson at FreeBSD.org
Mon Apr 25 10:10:05 PDT 2005
On Mon, 25 Apr 2005, Robert Watson wrote:
> I now have updated versions of these patches, which correct some
> inconsistencies in approach (universal use of curcpu now, for example),
> remove some debugging code, etc. I've received relatively little
> performance feedback on them, and would appreciate it if I could get
> some. :-) Especially as to whether these impact disk I/O related
> workloads, useful macrobenchmarks, etc. The latest patch is at:
>
>
> http://www.watson.org/~robert/freebsd/netperf/20050425-uma-mbuf-malloc-critical.diff
FYI: For those set up to track perforce, you can find the contents of this
patch in:
//depot/user/rwatson/percpu/...
In addition, that branch also contains diagnostic micro-benchmarks in the
kernel to measure the cost of various synchronization operations, memory
allocation operations, etc, which can be queried using "sysctl test".
Robert N M Watson
>
> The changes in the following files in the combined patch are intended to be
> broken out in to separate patches, as desired, as follows:
>
> kern_malloc.c malloc.diff
> kern_mbuf.c mbuf.diff
> uipc_mbuf.c mbuf.diff
> uipc_syscalls.c mbuf.diff
> malloc.h malloc.diff
> mbuf.h mbuf.diff
> pcpu.h malloc.diff, mbuf.diff, uma.diff
> uma_core.c uma.diff
> uma_int.h uma.diff
>
> I.e., the pcpu.h changes are a dependency for all of the remaining changes.
> As before, I'm interested in both the impact of individual patches, and the
> net effect of the total change associated with all patches applied.
>
> Because this diff was generated by p4, patch may need some help in
> identifying the targets of each part of the diff.
>
> Robert N M Watson
>
> On Sun, 17 Apr 2005, Robert Watson wrote:
>
>>
>> Attached please find three patches:
>>
>> (1) uma.diff, which modifies the UMA slab allocator to use critical
>> sections instead of mutexes to protect per-CPU caches.
>>
>> (2) malloc.diff, which modifies the malloc memory allocator to use
>> critical sections and per-CPU data instead of mutexes to store
>> per-malloc-type statistics, coalescing for the purposes of the sysctl
>> used to generate vmstat -m output.
>>
>> (3) mbuf.diff, which modifies the mbuf allocator to use per-CPU data and
>> critical sections for statistics, instead of synchronization-free
>> statistics which could result in substantial inconsistency on SMP
>> systems.
>>
>> These changes are facilitated by John Baldwin's recent re-introduction of
>> critical section optimizations that permit critical sections to be
>> implemented "in software", rather than using the hardware interrupt disable
>> mechanism, which is quite expensive on modern processors (especially Xeon
>> P4 CPUs). While not identical, this is similar to the softspl behavior in
>> 4.x, and Linux's preemption disable mechanisms (and various other post-Vax
>> systems :-)).
>>
>> The reason this is interesting is that it allows synchronization of per-CPU
>> data to be performed at a much lower cost than previously, and consistently
>> across UP and SMP systems. Prior to these changes, the use of critical
>> sections and per-CPU data as an alternative to mutexes would lead to an
>> improvement on SMP, but not on UP. So, that said, here's what I'd like us
>> to look at:
>>
>> - Patches (1) and (2) are intended to improve performance by reducing the
>> overhead of maintaining cache consistency and statistics for UMA and
>> malloc(9), and may universally impact performance (in a small way) due
>> to the breadth of their use through the kernel.
>>
>> - Patch (3) is intended to restore consistency to statistics in the
>> presence of SMP and preemption, at the possible cost of some
>> performance.
>>
>> I'd like to confirm that for the first two patches, for interesting
>> workloads, performance generally improves, and that stability doesn't
>> degrade. For the third partch, I'd like to quantify the cost of the
>> changes for interesting workloads, and likewise confirm no loss of
>> stability.
>>
>> Because these will have a relatively small impact, a fair amount of caution
>> is required in testing. We may be talking about a percent or two, maybe
>> four, difference in benchmark performance, and many benchmarks have a
>> higher variance than that.
>>
>> A couple of observations for those interested:
>>
>> - The INVARIANTS panic with UMA seen in some earlier patch versions is
>> believed to be corrected.
>>
>> - Right now, because I use arrays of foo[MAXCPUS], I'm concerned that
>> different CPUs will be writing to the same cache line as they're
>> adjacent in memory. Moving to per-CPU chunks of memory to hold this
>> stuff is desirable, but I think first we need to identify a model by
>> which to do that cleanly. I'm not currently enamored of the 'struct
>> pcpu' model, since it makes us very sensitive to ABI changes, as well as
>> not offering a model by which modules can register new per-cpu data
>> cleanly. I'm also inconsistent about how I dereference into the arrays,
>> and intend to move to using 'curcpu' throughout.
>>
>> - Because mutexes are no longer used in UMA, and not for the others
>> either, stats read across different CPUs that are coalesced may be
>> slightly inconsistent. I'm not all that concerned about it, but it's
>> worth thinking on.
>>
>> - Malloc stats for realloc() are still broken if you apply this patch.
>>
>> - High watermarks are no longer maintained for malloc since they require a
>> global notion of "high" that is tracked continuously (i.e., at each
>> change), and there's no longer a global view except when the observer
>> kicks in (sysctl). You can imagine various models to restore some
>> notion of a high watermark, but I'm not currently sure which is the
>> best. The high watermark notion is desirable though.
>>
>> So this is a request for:
>>
>> (1) Stability testing of these patches. Put them on a machine, make them
>> hurt. If things go South, try applying the patches one by one until
>> it's clear which is the source.
>>
>> (2) Performance testing of these patches. Subject to the challenges in
>> testing them. If you are interested, please test each patch
>> separately to evaluate its impact on your system. Then apply all
>> together and see how it evens out. You may find that the mbuf
>> allocator patch outweighs the benefits of the other two patches, if
>> so, that is interesting and something to work on!
>>
>> I've done some micro-benchmarking using tools like netblast,
>> syscall_timing, etc, but I'm interested particularly in the impact on
>> macrobenchmarks.
>>
>> Thanks!
>>
>> Robert N M Watson
> _______________________________________________
> freebsd-performance at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-performance
> To unsubscribe, send any mail to
> "freebsd-performance-unsubscribe at freebsd.org"
>
More information about the freebsd-performance
mailing list