From nobody Thu Feb 09 21:38:40 2023
X-Original-To: freebsd-hackers@mlmmj.nyi.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
	by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4PCVc25q7Fz3nmdn
	for <freebsd-hackers@mlmmj.nyi.freebsd.org>; Thu,  9 Feb 2023 21:38:42 +0000 (UTC)
	(envelope-from theraven@FreeBSD.org)
Received: from smtp.freebsd.org (smtp.freebsd.org [IPv6:2610:1c1:1:606c::24b:4])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256
	 client-signature RSA-PSS (4096 bits) client-digest SHA256)
	(Client CN "smtp.freebsd.org", Issuer "R3" (verified OK))
	by mx1.freebsd.org (Postfix) with ESMTPS id 4PCVc25Kzlz4GLL;
	Thu,  9 Feb 2023 21:38:42 +0000 (UTC)
	(envelope-from theraven@FreeBSD.org)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim;
	t=1675978722;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=SP/1WlxvQylAiIe3tHKGFQyP6mRWp2V4gr/XfNuvV5o=;
	b=TBef7PTOSzrKghx+ffG/lpRZJHe83OBsLsEe6BYOLsdmfx0Nq4nSPQX2Eu1zj0HCmN3UIu
	bCNdrRg9lfs5hKwqo8qAc8K9OkWmpV3g0kLThsPKY2+B6jW9kIylzKFDaTCpRvqnJ/VKT3
	TRaSwQtSZrnglLw6GKlEUBzxZbCLJ/bMytkKaP7KT5fewX5dYSFC57c8WKQwhP/DX+9pty
	SDOHHI6jKsqOPECqkU0AdTnewJ3zxunHzc1hDqff2Qo5w04BCSr+jDBiN6RcPPgCdcltCh
	X/6k3ccbpU4fqXm7f7nWJQxOqYssj6HLsRPSUcXRyJheEEwY1l+h3x+J/wyRog==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org;
	s=dkim; t=1675978722;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=SP/1WlxvQylAiIe3tHKGFQyP6mRWp2V4gr/XfNuvV5o=;
	b=QVNZ/SyIkPMxofUvd7lBwomf7h/iq7L2MEj8J9zMYbPjh1WsuySXGE6Ac/AEFf1bdHJ9GQ
	QzoHpAXdn0T8LLdySnlpUuQuH9q6f2abiGGn1T1Z+EgVbeIllfscuJ2Q0KrqnxIT0ydOx0
	Y1SYSTHHdnJloi7pOm+cWq3xBzak2JDa4K7GFvPnY2gY5VJSOyI2TVTYMeJhlSvmQGQzNM
	jkU+7uIVw5eTA8MIkqi/nk8VApYyCyrqRnsapHpwZ3CDVbWwUhtcw3bvrjzjIXAULhwYNr
	uQO7tP5wRDgJ07ZLYQyFh5/n5/Eei9C2mUj8sX2bnd62d8oS6+ExvMx4aM1YcQ==
ARC-Authentication-Results: i=1;
	mx1.freebsd.org;
	none
ARC-Seal: i=1; s=dkim; d=freebsd.org; t=1675978722; a=rsa-sha256; cv=none;
	b=Tpzzine3z9UlKKjDBgdfXoBgq+MGhYwGTB1EeNWOYOn27UCrcwZ+hJG7+LhyIg0oVopmKG
	9zkuVwPJ+sj453GPpAxGUPqyXoOvhDXbXhh0x1v6yREzGxh8+l/qfpOm9WZjYOnfhRc2aZ
	yMd6HYwpAMxXKQDHKcCVbXgLEFEBsiP4aOBBQP8DDto01oUkIH0bm6vQwwwl4R/KGj07UD
	gOXhY+hpj2Rkzcv9/GLHXpk00Qvbz5gysyQ33Y24e5tqZai+dzeKKmpRLkjqw55uHey49x
	yZcuvlCqTxa8i/bfpIqVVPEzUHXSyzZFbJQgpUiPoISYSwhVqTFgk9UIDReEyA==
Received: from smtp.theravensnest.org (smtp.theravensnest.org [45.77.103.195])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256)
	(Client did not present a certificate)
	(Authenticated sender: theraven)
	by smtp.freebsd.org (Postfix) with ESMTPSA id 4PCVc24GYmzXQW;
	Thu,  9 Feb 2023 21:38:42 +0000 (UTC)
	(envelope-from theraven@FreeBSD.org)
Received: from smtpclient.apple (host81-158-36-31.range81-158.btcentralplus.com [81.158.36.31])
	by smtp.theravensnest.org (Postfix) with ESMTPSA id B15E91058C;
	Thu,  9 Feb 2023 21:38:41 +0000 (GMT)
Content-Type: text/plain;
	charset=utf-8
List-Id: Technical discussions relating to FreeBSD <freebsd-hackers.freebsd.org>
List-Archive: https://lists.freebsd.org/archives/freebsd-hackers
List-Help: <mailto:freebsd-hackers+help@freebsd.org>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Subscribe: <mailto:freebsd-hackers+subscribe@freebsd.org>
List-Unsubscribe: <mailto:freebsd-hackers+unsubscribe@freebsd.org>
Sender: owner-freebsd-hackers@freebsd.org
Mime-Version: 1.0 (Mac OS X Mail 14.0 \(3654.120.0.1.13\))
Subject: Re: CFT: snmalloc as libc malloc
From: David Chisnall <theraven@FreeBSD.org>
In-Reply-To: <CAGudoHFeFm1K+JSBXaxt2SNTv-tGFgT0onyErOCmBdVjmHaxUg@mail.gmail.com>
Date: Thu, 9 Feb 2023 21:38:40 +0000
Cc: freebsd-hackers <freebsd-hackers@freebsd.org>
Content-Transfer-Encoding: quoted-printable
Message-Id: <2836EF0C-8B4B-441C-927B-3796457A3865@FreeBSD.org>
References: <2f3dcda0-5135-290a-2dff-683b2e9fe271@FreeBSD.org>
 <CAGudoHFYMLk6EDrSxLiWFNBoYyTKXfHLAUhZC+RF4eUE-rip8Q@mail.gmail.com>
 <E140A3A2-5C4A-4458-B365-AD693AB853E8@FreeBSD.org>
 <CAGudoHFeFm1K+JSBXaxt2SNTv-tGFgT0onyErOCmBdVjmHaxUg@mail.gmail.com>
To: Mateusz Guzik <mjguzik@gmail.com>
X-Mailer: Apple Mail (2.3654.120.0.1.13)
X-ThisMailContainsUnwantedMimeParts: N

On 9 Feb 2023, at 20:53, Mateusz Guzik <mjguzik@gmail.com> wrote:
>=20
> First and foremost, perhaps I should clear up that I have no opinion
> on snmalloc or it replacing jemalloc. I'm here only about the memcpy
> thing.
>=20
> Inspecting assembly is what I intended to do, but got the compilation =
problem.

If you just want to look at the memcpy, you might do better using the =
upstream (CMake) build system, which builds a shim library that you can =
disassemble and look at.

> So, as someone who worked on memcpy previously, I note the variant
> currently implemented in libc is pessimal for sizes > 16 bytes because
> it does not use SIMD. I do have plans to rectify that, but ENOTIME.

That=E2=80=99s one of the benefits of the C++ implementation.  We were =
able to try a few dozen variations in a morning.  Writing a single one =
of those in assembly would take a similar amount of time.

We were inspired by the automemcpy paper, which demonstrated that you =
can write memcpy implementations tuned to specific workloads in =
higher-level languages and get better performance.

> I also note CPUs are incredibly picky when it comes to starting point
> of the routine. The officialy recommended alignment of 16 bytes is
> basically a tradeoff between total binary size and performance. Should
> you move the routine at different 16 bytes intervals you will easily
> find big variation in performance, depending on how big of an
> alignment you ended up with.

Yes, that=E2=80=99s what our test does.  It does a large number of =
different copies with different sizes and alignments. =20

> I tried to compile the benchmark but got bested by c++. I don't know
> the lang and I don't want to fight it.
>=20
> $ pwd
> /usr/src/contrib/snmalloc/src
> $ c++ -I. test/perf/memcpy/memcpy.cc
> [snip]
> =
./snmalloc/global/../backend/../backend_helpers/../mem/../ds_core/bits.h:5=
7:26:
> error: no template named 'is_integral_v' in namespace 'std'; did you
> mean 'is_integral'?
>      static_assert(std::is_integral_v<T>, "Type must be integral");
>                    ~~~~~^~~~~~~~~~~~~
>                         is_integral

I think the only thing missing is -std=3Dc++17.

> I'm trying to say that if you end up with different funcs depending on
> bounds checking and you only align them to 16 bytes, the benchmark is
> most likely inaccurate if only for this reason.

I=E2=80=99m not sure I follow this, sorry.

>> The fastest on x86 is roughly:
>>=20
>> - A jump table of power for small sizes that do power-of-two-sized =
small
>> copies (double-word, word, half-word, and byte) to perform the copy.
>=20
> Last I checked this was not true. The last slow thing to do was to
> branch on few sizes and handle them with overlapping stores. Roughly
> what memcpy in libc is doing now.
>=20
> Jump table aside, you got all sizes "spelled out", that is just
> avoidable impact on icache. While overlapping stores do come with some
> penalty, it is cheaper than the above combined.

They=E2=80=99re not all spelled out, because a lot of them are subsets =
of the others.  The compiler does a *very* good job of optimising this.  =
The generated assembly was a lot more complex than anything I could =
write by hand.

> I don't have numbers nor bench code handy, but if you insist on
> contesting the above I'll be glad to provide them.
>=20
>> - A vectorised copy for medium-sized copies using a loop of SSE =
copies.
>=20
> Depends on what you mean by medium and which simd instructions, but
> yes, they should be used here.

This could probably do with more tuning, but all of this is configurable =
with a couple of template parameters.  If it=E2=80=99s useful to have =
more variants for different microarchitectures then it=E2=80=99s a =
trivial tweak to generate them.

>> - rep movsb for large copies.
>=20
> There are still old cpus here and there which don't support ERMS. They
> are very negatively impacted the above and should roll with rep stosq
> instead.

We tested on some microarchitectures without FRMS but didn=E2=80=99t =
compare with rep stosq as an alternative.  The rep movsb is inline =
assembly =
(https://github.com/microsoft/snmalloc/blob/4370a23f3e5f1d1d06967b1e0d6162=
bf1ee81196/src/snmalloc/global/memcpy.h#L351) so it would be quite easy =
to compare.  Very happy to take patches that make things faster!

PowerPC doesn=E2=80=99t have an equivalent of rep movsb, so the PowerPC =
version doesn=E2=80=99t have an equivalent codepath.

Each tuned version is a dozen lines of code (plus a lot of comments to =
explain *why* it does things a particular way), so it=E2=80=99s very =
easy to add different versions with different tuning. =20

> I see the code takes some care to align the target buffer. That's
> good, but not necessary on CPUs with FSRM.

It doesn=E2=80=99t hurt though, on the sizes where it=E2=80=99s actually =
beneficial to use the rep sequence the cost of alignment is negligible.

> All that said, I have hard time believing the impact of bounds
> checking on short copies is around 20% or so. The code to do it looks
> super slow (not that I know to do better).

The last time I looked at the disassembly, the code for the bounds check =
touched three cache lines and has two branches.  It fits well in a =
superscalar pipeline so adds a handful of cycles.  This is basically in =
the noise for large copies but is double-digit percentage overhead for =
small ones.

> People like to claim short sizes are inlined by the compiler, but
> that's only true if the size is known at compilation time, which it
> often is not. Then they claim they are rare.

I agree.

> To illustrate why that's bogus, here is clang 15 compiling =
vfs_cache.c:
>           value  ------------- Distribution ------------- count
>              -1 |                                         0
>               0 |@                                        19758
>               1 |@@@@@@@@                                 218420
>               2 |@@                                       67670
>               4 |@@@@                                     103914
>               8 |@@@@@@@@@@@                              301157
>              16 |@@@@@@@@@@                               293812
>              32 |@@                                       57954
>              64 |@                                        23818
>             128 |                                         13344
>             256 |@                                        18507
>             512 |                                         6342
>            1024 |                                         1710
>            2048 |                                         627
>            4096 |                                         398
>            8192 |                                         34
>           16384 |                                         10
>           32768 |                                         6
>           65536 |                                         7
>          131072 |                                         4
>          262144 |                                         1
>          524288 |                                         1
>         1048576 |                                         0
>=20
> obtained with this bad boy:
> dtrace -n 'pid$target::memcpy:entry { @ =3D quantize(arg2); }' -c "cc
> -target x86_64-unknown-freebsd14.0
> --sysroot=3D/usr/obj/usr/src/amd64.amd64/tmp
> -B/usr/obj/usr/src/amd64.amd64/tmp/usr/bin -c -O2 -pipe
> -fno-strict-aliasing  -g -nostdinc  -I. -I/usr/src/sys
> -I/usr/src/sys/contrib/ck/include -I/usr/src/sys/contrib/libfdt
> -D_KERNEL -DHAVE_KERNEL_OPTION_HEADERS -include opt_global.h
> -fno-common    -fno-omit-frame-pointer -mno-omit-leaf-frame-pointer
> -MD  -MF.depend.vfs_cache.o -MTvfs_cache.o
> -fdebug-prefix-map=3D./machine=3D/usr/src/sys/amd64/include
> -fdebug-prefix-map=3D./x86=3D/usr/src/sys/x86/include
> -fdebug-prefix-map=3D./i386=3D/usr/src/sys/i386/include =
-mcmodel=3Dkernel
> -mno-red-zone -mno-mmx -mno-sse -msoft-float
> -fno-asynchronous-unwind-tables -ffreestanding -fwrapv
> -fstack-protector -Wall -Wnested-externs -Wstrict-prototypes
> -Wmissing-prototypes -Wpointer-arith -Wcast-qual -Wundef
> -Wno-pointer-sign -D__printf__=3D__freebsd_kprintf__
> -Wmissing-include-dirs -fdiagnostics-show-option -Wno-unknown-pragmas
> -Wno-error=3Dtautological-compare -Wno-error=3Dempty-body
> -Wno-error=3Dparentheses-equality -Wno-error=3Dunused-function
> -Wno-error=3Dpointer-sign -Wno-error=3Dshift-negative-value
> -Wno-address-of-packed-member -Wno-error=3Darray-parameter
> -Wno-error=3Ddeprecated-non-prototype -Wno-error=3Dstrict-prototypes
> -Wno-error=3Dunused-but-set-variable -Wno-format-zero-length   =
-mno-aes
> -mno-avx  -std=3Diso9899:1999 -Werror /usr/src/sys/kern/vfs_cache.c=E2=80=
=9D

That=E2=80=99s a really helpful datapoint, thanks! That distribution =
matches roughly what we=E2=80=99ve seen - the number of zero-byte memcpy =
calls surprised me the first time I saw it, but the automemcpy paper =
reported something similar.

If you can also extract the alignments from these then it would be =
fairly easy to modify the benchmark program to use those distributions.

My original version extended the FreeBSD assembly memcpy with a call to =
a function that did the bounds checks, but that had *much* worse =
performance.  We could insert assembly to do the bounds checks, but =
that=E2=80=99s hard to update if we change the metadata layout in =
malloc.

We could modify the ifunc resolvers to use the current memcpy if you =
don=E2=80=99t care about bounds checks.  I did a tiny amount of testing =
(mostly clang and lld) in this configuration and it was slower than =
using the snmalloc memcpy, but I didn=E2=80=99t do enough benchmarking =
to be confident of that.  Hence the CFT.

David