From nobody Thu Feb 09 21:38:40 2023 X-Original-To: freebsd-hackers@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4PCVc25q7Fz3nmdn for ; Thu, 9 Feb 2023 21:38:42 +0000 (UTC) (envelope-from theraven@FreeBSD.org) Received: from smtp.freebsd.org (smtp.freebsd.org [IPv6:2610:1c1:1:606c::24b:4]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (4096 bits) client-digest SHA256) (Client CN "smtp.freebsd.org", Issuer "R3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4PCVc25Kzlz4GLL; Thu, 9 Feb 2023 21:38:42 +0000 (UTC) (envelope-from theraven@FreeBSD.org) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim; t=1675978722; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=SP/1WlxvQylAiIe3tHKGFQyP6mRWp2V4gr/XfNuvV5o=; b=TBef7PTOSzrKghx+ffG/lpRZJHe83OBsLsEe6BYOLsdmfx0Nq4nSPQX2Eu1zj0HCmN3UIu bCNdrRg9lfs5hKwqo8qAc8K9OkWmpV3g0kLThsPKY2+B6jW9kIylzKFDaTCpRvqnJ/VKT3 TRaSwQtSZrnglLw6GKlEUBzxZbCLJ/bMytkKaP7KT5fewX5dYSFC57c8WKQwhP/DX+9pty SDOHHI6jKsqOPECqkU0AdTnewJ3zxunHzc1hDqff2Qo5w04BCSr+jDBiN6RcPPgCdcltCh X/6k3ccbpU4fqXm7f7nWJQxOqYssj6HLsRPSUcXRyJheEEwY1l+h3x+J/wyRog== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim; t=1675978722; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=SP/1WlxvQylAiIe3tHKGFQyP6mRWp2V4gr/XfNuvV5o=; b=QVNZ/SyIkPMxofUvd7lBwomf7h/iq7L2MEj8J9zMYbPjh1WsuySXGE6Ac/AEFf1bdHJ9GQ QzoHpAXdn0T8LLdySnlpUuQuH9q6f2abiGGn1T1Z+EgVbeIllfscuJ2Q0KrqnxIT0ydOx0 Y1SYSTHHdnJloi7pOm+cWq3xBzak2JDa4K7GFvPnY2gY5VJSOyI2TVTYMeJhlSvmQGQzNM jkU+7uIVw5eTA8MIkqi/nk8VApYyCyrqRnsapHpwZ3CDVbWwUhtcw3bvrjzjIXAULhwYNr uQO7tP5wRDgJ07ZLYQyFh5/n5/Eei9C2mUj8sX2bnd62d8oS6+ExvMx4aM1YcQ== ARC-Authentication-Results: i=1; mx1.freebsd.org; none ARC-Seal: i=1; s=dkim; d=freebsd.org; t=1675978722; a=rsa-sha256; cv=none; b=Tpzzine3z9UlKKjDBgdfXoBgq+MGhYwGTB1EeNWOYOn27UCrcwZ+hJG7+LhyIg0oVopmKG 9zkuVwPJ+sj453GPpAxGUPqyXoOvhDXbXhh0x1v6yREzGxh8+l/qfpOm9WZjYOnfhRc2aZ yMd6HYwpAMxXKQDHKcCVbXgLEFEBsiP4aOBBQP8DDto01oUkIH0bm6vQwwwl4R/KGj07UD gOXhY+hpj2Rkzcv9/GLHXpk00Qvbz5gysyQ33Y24e5tqZai+dzeKKmpRLkjqw55uHey49x yZcuvlCqTxa8i/bfpIqVVPEzUHXSyzZFbJQgpUiPoISYSwhVqTFgk9UIDReEyA== Received: from smtp.theravensnest.org (smtp.theravensnest.org [45.77.103.195]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) (Authenticated sender: theraven) by smtp.freebsd.org (Postfix) with ESMTPSA id 4PCVc24GYmzXQW; Thu, 9 Feb 2023 21:38:42 +0000 (UTC) (envelope-from theraven@FreeBSD.org) Received: from smtpclient.apple (host81-158-36-31.range81-158.btcentralplus.com [81.158.36.31]) by smtp.theravensnest.org (Postfix) with ESMTPSA id B15E91058C; Thu, 9 Feb 2023 21:38:41 +0000 (GMT) Content-Type: text/plain; charset=utf-8 List-Id: Technical discussions relating to FreeBSD List-Archive: https://lists.freebsd.org/archives/freebsd-hackers List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-hackers@freebsd.org Mime-Version: 1.0 (Mac OS X Mail 14.0 \(3654.120.0.1.13\)) Subject: Re: CFT: snmalloc as libc malloc From: David Chisnall In-Reply-To: Date: Thu, 9 Feb 2023 21:38:40 +0000 Cc: freebsd-hackers Content-Transfer-Encoding: quoted-printable Message-Id: <2836EF0C-8B4B-441C-927B-3796457A3865@FreeBSD.org> References: <2f3dcda0-5135-290a-2dff-683b2e9fe271@FreeBSD.org> To: Mateusz Guzik X-Mailer: Apple Mail (2.3654.120.0.1.13) X-ThisMailContainsUnwantedMimeParts: N On 9 Feb 2023, at 20:53, Mateusz Guzik wrote: >=20 > First and foremost, perhaps I should clear up that I have no opinion > on snmalloc or it replacing jemalloc. I'm here only about the memcpy > thing. >=20 > Inspecting assembly is what I intended to do, but got the compilation = problem. If you just want to look at the memcpy, you might do better using the = upstream (CMake) build system, which builds a shim library that you can = disassemble and look at. > So, as someone who worked on memcpy previously, I note the variant > currently implemented in libc is pessimal for sizes > 16 bytes because > it does not use SIMD. I do have plans to rectify that, but ENOTIME. That=E2=80=99s one of the benefits of the C++ implementation. We were = able to try a few dozen variations in a morning. Writing a single one = of those in assembly would take a similar amount of time. We were inspired by the automemcpy paper, which demonstrated that you = can write memcpy implementations tuned to specific workloads in = higher-level languages and get better performance. > I also note CPUs are incredibly picky when it comes to starting point > of the routine. The officialy recommended alignment of 16 bytes is > basically a tradeoff between total binary size and performance. Should > you move the routine at different 16 bytes intervals you will easily > find big variation in performance, depending on how big of an > alignment you ended up with. Yes, that=E2=80=99s what our test does. It does a large number of = different copies with different sizes and alignments. =20 > I tried to compile the benchmark but got bested by c++. I don't know > the lang and I don't want to fight it. >=20 > $ pwd > /usr/src/contrib/snmalloc/src > $ c++ -I. test/perf/memcpy/memcpy.cc > [snip] > = ./snmalloc/global/../backend/../backend_helpers/../mem/../ds_core/bits.h:5= 7:26: > error: no template named 'is_integral_v' in namespace 'std'; did you > mean 'is_integral'? > static_assert(std::is_integral_v, "Type must be integral"); > ~~~~~^~~~~~~~~~~~~ > is_integral I think the only thing missing is -std=3Dc++17. > I'm trying to say that if you end up with different funcs depending on > bounds checking and you only align them to 16 bytes, the benchmark is > most likely inaccurate if only for this reason. I=E2=80=99m not sure I follow this, sorry. >> The fastest on x86 is roughly: >>=20 >> - A jump table of power for small sizes that do power-of-two-sized = small >> copies (double-word, word, half-word, and byte) to perform the copy. >=20 > Last I checked this was not true. The last slow thing to do was to > branch on few sizes and handle them with overlapping stores. Roughly > what memcpy in libc is doing now. >=20 > Jump table aside, you got all sizes "spelled out", that is just > avoidable impact on icache. While overlapping stores do come with some > penalty, it is cheaper than the above combined. They=E2=80=99re not all spelled out, because a lot of them are subsets = of the others. The compiler does a *very* good job of optimising this. = The generated assembly was a lot more complex than anything I could = write by hand. > I don't have numbers nor bench code handy, but if you insist on > contesting the above I'll be glad to provide them. >=20 >> - A vectorised copy for medium-sized copies using a loop of SSE = copies. >=20 > Depends on what you mean by medium and which simd instructions, but > yes, they should be used here. This could probably do with more tuning, but all of this is configurable = with a couple of template parameters. If it=E2=80=99s useful to have = more variants for different microarchitectures then it=E2=80=99s a = trivial tweak to generate them. >> - rep movsb for large copies. >=20 > There are still old cpus here and there which don't support ERMS. They > are very negatively impacted the above and should roll with rep stosq > instead. We tested on some microarchitectures without FRMS but didn=E2=80=99t = compare with rep stosq as an alternative. The rep movsb is inline = assembly = (https://github.com/microsoft/snmalloc/blob/4370a23f3e5f1d1d06967b1e0d6162= bf1ee81196/src/snmalloc/global/memcpy.h#L351) so it would be quite easy = to compare. Very happy to take patches that make things faster! PowerPC doesn=E2=80=99t have an equivalent of rep movsb, so the PowerPC = version doesn=E2=80=99t have an equivalent codepath. Each tuned version is a dozen lines of code (plus a lot of comments to = explain *why* it does things a particular way), so it=E2=80=99s very = easy to add different versions with different tuning. =20 > I see the code takes some care to align the target buffer. That's > good, but not necessary on CPUs with FSRM. It doesn=E2=80=99t hurt though, on the sizes where it=E2=80=99s actually = beneficial to use the rep sequence the cost of alignment is negligible. > All that said, I have hard time believing the impact of bounds > checking on short copies is around 20% or so. The code to do it looks > super slow (not that I know to do better). The last time I looked at the disassembly, the code for the bounds check = touched three cache lines and has two branches. It fits well in a = superscalar pipeline so adds a handful of cycles. This is basically in = the noise for large copies but is double-digit percentage overhead for = small ones. > People like to claim short sizes are inlined by the compiler, but > that's only true if the size is known at compilation time, which it > often is not. Then they claim they are rare. I agree. > To illustrate why that's bogus, here is clang 15 compiling = vfs_cache.c: > value ------------- Distribution ------------- count > -1 | 0 > 0 |@ 19758 > 1 |@@@@@@@@ 218420 > 2 |@@ 67670 > 4 |@@@@ 103914 > 8 |@@@@@@@@@@@ 301157 > 16 |@@@@@@@@@@ 293812 > 32 |@@ 57954 > 64 |@ 23818 > 128 | 13344 > 256 |@ 18507 > 512 | 6342 > 1024 | 1710 > 2048 | 627 > 4096 | 398 > 8192 | 34 > 16384 | 10 > 32768 | 6 > 65536 | 7 > 131072 | 4 > 262144 | 1 > 524288 | 1 > 1048576 | 0 >=20 > obtained with this bad boy: > dtrace -n 'pid$target::memcpy:entry { @ =3D quantize(arg2); }' -c "cc > -target x86_64-unknown-freebsd14.0 > --sysroot=3D/usr/obj/usr/src/amd64.amd64/tmp > -B/usr/obj/usr/src/amd64.amd64/tmp/usr/bin -c -O2 -pipe > -fno-strict-aliasing -g -nostdinc -I. -I/usr/src/sys > -I/usr/src/sys/contrib/ck/include -I/usr/src/sys/contrib/libfdt > -D_KERNEL -DHAVE_KERNEL_OPTION_HEADERS -include opt_global.h > -fno-common -fno-omit-frame-pointer -mno-omit-leaf-frame-pointer > -MD -MF.depend.vfs_cache.o -MTvfs_cache.o > -fdebug-prefix-map=3D./machine=3D/usr/src/sys/amd64/include > -fdebug-prefix-map=3D./x86=3D/usr/src/sys/x86/include > -fdebug-prefix-map=3D./i386=3D/usr/src/sys/i386/include = -mcmodel=3Dkernel > -mno-red-zone -mno-mmx -mno-sse -msoft-float > -fno-asynchronous-unwind-tables -ffreestanding -fwrapv > -fstack-protector -Wall -Wnested-externs -Wstrict-prototypes > -Wmissing-prototypes -Wpointer-arith -Wcast-qual -Wundef > -Wno-pointer-sign -D__printf__=3D__freebsd_kprintf__ > -Wmissing-include-dirs -fdiagnostics-show-option -Wno-unknown-pragmas > -Wno-error=3Dtautological-compare -Wno-error=3Dempty-body > -Wno-error=3Dparentheses-equality -Wno-error=3Dunused-function > -Wno-error=3Dpointer-sign -Wno-error=3Dshift-negative-value > -Wno-address-of-packed-member -Wno-error=3Darray-parameter > -Wno-error=3Ddeprecated-non-prototype -Wno-error=3Dstrict-prototypes > -Wno-error=3Dunused-but-set-variable -Wno-format-zero-length = -mno-aes > -mno-avx -std=3Diso9899:1999 -Werror /usr/src/sys/kern/vfs_cache.c=E2=80= =9D That=E2=80=99s a really helpful datapoint, thanks! That distribution = matches roughly what we=E2=80=99ve seen - the number of zero-byte memcpy = calls surprised me the first time I saw it, but the automemcpy paper = reported something similar. If you can also extract the alignments from these then it would be = fairly easy to modify the benchmark program to use those distributions. My original version extended the FreeBSD assembly memcpy with a call to = a function that did the bounds checks, but that had *much* worse = performance. We could insert assembly to do the bounds checks, but = that=E2=80=99s hard to update if we change the metadata layout in = malloc. We could modify the ifunc resolvers to use the current memcpy if you = don=E2=80=99t care about bounds checks. I did a tiny amount of testing = (mostly clang and lld) in this configuration and it was slower than = using the snmalloc memcpy, but I didn=E2=80=99t do enough benchmarking = to be confident of that. Hence the CFT. David