Re: Stressing malloc(9)

From: Alan Somers <asomers_at_freebsd.org>
Date: Sat, 20 Apr 2024 17:23:41 UTC
On Sat, Apr 20, 2024 at 9:07 AM Mark Johnston <markj@freebsd.org> wrote:
>
> On Fri, Apr 19, 2024 at 04:23:51PM -0600, Alan Somers wrote:
> > TLDR;
> > How can I create a workload that causes malloc(9)'s performance to plummet?
> >
> > Background:
> > I recently witnessed a performance problem on a production server.
> > Overall throughput dropped by over 30x.  dtrace showed that 60% of the
> > CPU time was dominated by lock_delay as called by three functions:
> > printf (via ctl_worker_thread), g_eli_alloc_data, and
> > g_eli_write_done.  One thing those three have in common is that they
> > all use malloc(9).  Fixing the problem was as simple as telling CTL to
> > stop printing so many warnings, by tuning
> > kern.cam.ctl.time_io_secs=100000.
> >
> > But even with CTL quieted, dtrace still reports ~6% of the CPU cycles
> > in lock_delay via g_eli_alloc_data.  So I believe that malloc is
> > limiting geli's performance.  I would like to try replacing it with
> > uma(9).
>
> What is the size of the allocations that g_eli_alloc_data() is doing?
> malloc() is a pretty thin layer over UMA for allocations <= 64KB.
> Larger allocations are handled by a different path (malloc_large())
> which goes directly to the kmem_* allocator functions.  Those functions
> are very expensive: they're serialized by global locks and need to
> update the pmap (and perform TLB shootdowns when memory is freed).
> They're not meant to be used at a high rate.

In my benchmarks so far, 512B.  In the real application the size is
mostly between 4k and 16k, and it's always a multiple of 4k. But it's
sometimes great enough to use malloc_large, and it's those
malloc_large calls that account for the majority of the time spent in
g_eli_alloc_data.  lockstat shows that malloc_large, as called by
g_elI_alloc_data, sometimes blocks for multiple ms.

But oddly, if I change the parameters so that g_eli_alloc_data
allocates 128kB, I still don't see malloc_large getting called.  And
both dtrace and vmstat show that malloc is mostly operating on 512B
allocations.  But dtrace does confirm that g_eli_alloc_data is being
called with 128kB arguments.  Maybe something is getting inlined?  I
don't understand how this is happening.  I could probably figure out
if I recompile with some extra SDT probes, though.

>
> My first guess would be that your production workload was hitting this
> path, and your benchmarks are not.  If you have stack traces or lock
> names from DTrace, that would help validate this theory, in which case
> using UMA to cache buffers would be a reasonable solution.

Would that require creating an extra UMA zone for every possible geli
allocation size above 64kB?

>
> > But on a non-production server, none of my benchmark workloads causes
> > g_eli_alloc_data to break a sweat.  I can't get its CPU consumption to
> > rise higher than 0.5%.  And that's using the smallest sector size and
> > block size that I can.
> >
> > So my question is: does anybody have a program that can really stress
> > malloc(9)?  I'd like to run it in parallel with my geli benchmarks to
> > see how much it interferes.
> >
> > -Alan
> >