From nobody Sun Apr 21 23:47:41 2024
X-Original-To: freebsd-hackers@mlmmj.nyi.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
	by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4VN4nR5jWfz5HpQH
	for <freebsd-hackers@mlmmj.nyi.freebsd.org>; Sun, 21 Apr 2024 23:47:55 +0000 (UTC)
	(envelope-from asomers@gmail.com)
Received: from mail-ua1-f42.google.com (mail-ua1-f42.google.com [209.85.222.42])
	(using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256
	 client-signature RSA-PSS (2048 bits) client-digest SHA256)
	(Client CN "smtp.gmail.com", Issuer "GTS CA 1D4" (verified OK))
	by mx1.freebsd.org (Postfix) with ESMTPS id 4VN4nR1tp6z4LyD;
	Sun, 21 Apr 2024 23:47:55 +0000 (UTC)
	(envelope-from asomers@gmail.com)
Authentication-Results: mx1.freebsd.org;
	none
Received: by mail-ua1-f42.google.com with SMTP id a1e0cc1a2514c-7e3e0b2f653so1243131241.0;
        Sun, 21 Apr 2024 16:47:55 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1713743273; x=1714348073;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=BgEgqqN9xzpss+9fNA45cMgyI1fMp1hGVLnbpKdZseg=;
        b=EAZKJap+vXnsz15e0IH0uRwbHISGTBOq6kvCMLQez7F89dWmdEBk3m85XiBvfBsnOX
         P/9+ekyFGcC3J4YYI1n0cvKN96kkM+tqnJ/gPxR02ryI5wyQq4lfOyEyMf+T3GNkel02
         XVupjLBvT6SvONUzTkf9HOGrdddIgLHv+h1wZpIKHe1a9X6xc0qoCVHRVLjQnhxuf4tL
         IhkirVdlw/XWg25PQsM1q1EGaI3B4kAEOhGm6qzxtProMpX1SPeuRTd+GmtWIVm3fbCS
         hjuCiGjKDZcUvP+uD7OeudukEknyoxxGEoDaMD5Fn+5F/DCos/LwYRLIdGERfp/5Rqrl
         4RGQ==
X-Gm-Message-State: AOJu0YzWlbB8iU6C9QQMm3bHNmdv/Hfk1ctcuLAu2ZZNbGFckLctXAcX
	r5syoBZCvvimjDSIotGla0J9AXcQlr0hdXAXRO/t9cAPpJGkhoVww0ol67jbnBaam3aISbQPY2m
	pF+mgmynfMc10Slc9rXMp/F2HlfT4vQ==
X-Google-Smtp-Source: AGHT+IFxDnbNtOqtZKCywqh/op1O7Fda8X/4bVveoZWZlW9OM4xBqxlVdI+6DZzDa5GDy3UQzUKLwoStLxC2vY/kBhk=
X-Received: by 2002:a05:6122:2a0e:b0:4dc:d7b2:7602 with SMTP id
 fw14-20020a0561222a0e00b004dcd7b27602mr9166467vkb.1.1713743272966; Sun, 21
 Apr 2024 16:47:52 -0700 (PDT)
List-Id: Technical discussions relating to FreeBSD <freebsd-hackers.freebsd.org>
List-Archive: https://lists.freebsd.org/archives/freebsd-hackers
List-Help: <mailto:freebsd-hackers+help@freebsd.org>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Subscribe: <mailto:freebsd-hackers+subscribe@freebsd.org>
List-Unsubscribe: <mailto:freebsd-hackers+unsubscribe@freebsd.org>
Sender: owner-freebsd-hackers@FreeBSD.org
MIME-Version: 1.0
References: <CAOtMX2jeDHS15bGgzD89AOAd1SzS_=FikorkCdv9-eAxCZ2P5w@mail.gmail.com>
 <ZiPaFw0q17RGE7cS@nuc> <CAOtMX2jk6+SvqMP7Cbmdk0KQCFZ34yWuir7n_8ewZYJF2MwPSg@mail.gmail.com>
 <ZiU6IZ29syVsg61p@nuc>
In-Reply-To: <ZiU6IZ29syVsg61p@nuc>
From: Alan Somers <asomers@freebsd.org>
Date: Sun, 21 Apr 2024 17:47:41 -0600
Message-ID: <CAOtMX2j=yaYeE+-fycg2mRRC_Jb9p74cn_dcenhH2xRRxz1shg@mail.gmail.com>
Subject: Re: Stressing malloc(9)
To: Mark Johnston <markj@freebsd.org>
Cc: FreeBSD Hackers <freebsd-hackers@freebsd.org>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Spamd-Bar: ----
X-Rspamd-Pre-Result: action=no action;
	module=replies;
	Message is reply to one we originated
X-Spamd-Result: default: False [-4.00 / 15.00];
	REPLY(-4.00)[];
	ASN(0.00)[asn:15169, ipnet:209.85.128.0/17, country:US]
X-Rspamd-Queue-Id: 4VN4nR1tp6z4LyD

On Sun, Apr 21, 2024 at 10:09=E2=80=AFAM Mark Johnston <markj@freebsd.org> =
wrote:
>
> On Sat, Apr 20, 2024 at 11:23:41AM -0600, Alan Somers wrote:
> > On Sat, Apr 20, 2024 at 9:07=E2=80=AFAM Mark Johnston <markj@freebsd.or=
g> wrote:
> > >
> > > On Fri, Apr 19, 2024 at 04:23:51PM -0600, Alan Somers wrote:
> > > > TLDR;
> > > > How can I create a workload that causes malloc(9)'s performance to =
plummet?
> > > >
> > > > Background:
> > > > I recently witnessed a performance problem on a production server.
> > > > Overall throughput dropped by over 30x.  dtrace showed that 60% of =
the
> > > > CPU time was dominated by lock_delay as called by three functions:
> > > > printf (via ctl_worker_thread), g_eli_alloc_data, and
> > > > g_eli_write_done.  One thing those three have in common is that the=
y
> > > > all use malloc(9).  Fixing the problem was as simple as telling CTL=
 to
> > > > stop printing so many warnings, by tuning
> > > > kern.cam.ctl.time_io_secs=3D100000.
> > > >
> > > > But even with CTL quieted, dtrace still reports ~6% of the CPU cycl=
es
> > > > in lock_delay via g_eli_alloc_data.  So I believe that malloc is
> > > > limiting geli's performance.  I would like to try replacing it with
> > > > uma(9).
> > >
> > > What is the size of the allocations that g_eli_alloc_data() is doing?
> > > malloc() is a pretty thin layer over UMA for allocations <=3D 64KB.
> > > Larger allocations are handled by a different path (malloc_large())
> > > which goes directly to the kmem_* allocator functions.  Those functio=
ns
> > > are very expensive: they're serialized by global locks and need to
> > > update the pmap (and perform TLB shootdowns when memory is freed).
> > > They're not meant to be used at a high rate.
> >
> > In my benchmarks so far, 512B.  In the real application the size is
> > mostly between 4k and 16k, and it's always a multiple of 4k. But it's
> > sometimes great enough to use malloc_large, and it's those
> > malloc_large calls that account for the majority of the time spent in
> > g_eli_alloc_data.  lockstat shows that malloc_large, as called by
> > g_elI_alloc_data, sometimes blocks for multiple ms.
> >
> > But oddly, if I change the parameters so that g_eli_alloc_data
> > allocates 128kB, I still don't see malloc_large getting called.  And
> > both dtrace and vmstat show that malloc is mostly operating on 512B
> > allocations.  But dtrace does confirm that g_eli_alloc_data is being
> > called with 128kB arguments.  Maybe something is getting inlined?
>
> malloc_large() is annotated __noinline, for what it's worth.
>
> > I
> > don't understand how this is happening.  I could probably figure out
> > if I recompile with some extra SDT probes, though.
>
> What is g_eli_alloc_sz on your system?

33kiB.  That's larger than I expected.  When I use a larger blocksize
in my benchmark, then I do indeed see malloc_large activity, and 11%
of the CPU is spend in g_eli_alloc_data.

I guess I'll add some UMA zones for this purpose.  I'll try 256k and
512k zones, rounding up allocations as necessary.  Thanks for the tip.


>
> > > My first guess would be that your production workload was hitting thi=
s
> > > path, and your benchmarks are not.  If you have stack traces or lock
> > > names from DTrace, that would help validate this theory, in which cas=
e
> > > using UMA to cache buffers would be a reasonable solution.
> >
> > Would that require creating an extra UMA zone for every possible geli
> > allocation size above 64kB?
>
> Something like that.  Or have a zone of maxphys-sized buffers (actually
> I think it needs to be slightly larger than that?) and accept the
> corresponding waste, given that these allocations are short-lived.  This
> is basically what g_eli_alloc_data() already does.
>
> > > > But on a non-production server, none of my benchmark workloads caus=
es
> > > > g_eli_alloc_data to break a sweat.  I can't get its CPU consumption=
 to
> > > > rise higher than 0.5%.  And that's using the smallest sector size a=
nd
> > > > block size that I can.
> > > >
> > > > So my question is: does anybody have a program that can really stre=
ss
> > > > malloc(9)?  I'd like to run it in parallel with my geli benchmarks =
to
> > > > see how much it interferes.
> > > >
> > > > -Alan
> > > >