From nobody Mon Feb 13 12:48:03 2023 X-Original-To: freebsd-hackers@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4PFkdx0pkVz3qTnT for ; Mon, 13 Feb 2023 12:48:05 +0000 (UTC) (envelope-from theraven@FreeBSD.org) Received: from smtp.freebsd.org (smtp.freebsd.org [96.47.72.83]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (4096 bits) client-digest SHA256) (Client CN "smtp.freebsd.org", Issuer "R3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4PFkdx0J0Zz40Y8 for ; Mon, 13 Feb 2023 12:48:05 +0000 (UTC) (envelope-from theraven@FreeBSD.org) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim; t=1676292485; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=GV7FdpbAhcp06dBq3fmhbjB84tWg2juRZA2GoxRwCtw=; b=rwJhG+MgFbq9bJuyyOVYpdSBh1wtBsem51kLSg5Z92Zsmj8rgHUagK66LhM+fsfwKyMk7Q C1ymKU5IU0DXpxqbLGrITRwdGRQ01MA0cX877pB5Ag75mdqj7k4Q+ghWJY6/5BeaCiI5EK +jacZC0t9NrWZRL+SqAPh+uVk+0REB2SLOgbKQrttxDSi+u4nzBnCrozwVcOkR5v6bZ6dp WZCDJc6YTA1qHD6H5pj32WfD4+9h+JR3OEtmQWItM2neuo042ftWuVGbh+/6++wBqm/x5O 9e/C+3SCP5uyHsRRPX8CXk+VETf3xAAPp0hFULA3b/MrsrBdcKZIln/JB00hMw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim; t=1676292485; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=GV7FdpbAhcp06dBq3fmhbjB84tWg2juRZA2GoxRwCtw=; b=OnK6KqHDoDQRVOVXsX7SARFEHeNIWrXX11uhsbWvJdTkfN3JtLPLEKxrkQdARxF+a2sYZQ QTQkho39BjHYKMCABUfzQuV5yunhLV2fEnrEliNNfDDkTbYncRdpPaO2zqYhvRmigvQvCG ajjKw7IKps/y9yfKQzw9rCfQlC7l11XWOfyC8I0IsJSHSlqpe2Uh7BwQqI8GCZWtpJdTkz +X8r7qoauBvo+FB9o/inH9iqN/gTMrRMOJj22dFlR9BM2FadGf3pfvKnII11Ed6ZQAaITW 8KWYMcepBL8d5xq8+dHTkjt+Zdwat8e6I2SNngpgEShmi/ENoBGjSfIq4H+lOg== ARC-Authentication-Results: i=1; mx1.freebsd.org; none ARC-Seal: i=1; s=dkim; d=freebsd.org; t=1676292485; a=rsa-sha256; cv=none; b=OhEycdioKDIdd5xD6h1Qi7er62LTx9N+qzDCkCmXSfTaN1H5fih5r0YgmT6Av6cbG8suWd GiF2CSXEJRvb0CnAwG4QKetFhbmJyAmdbcTgAvhxwg6esKvSnDyow9nxbkv4Pqg/2priDv vezxMmsDoijddRnUNFHDw18wc6+O4++0DMcjWt5Ia3opYI1Xcugf+912/FxttL+HfwkHVt sfoYE2cTArgQzZS+RjxUNvLPKhqrroawzIoiDfgMPUe9yql7L0pjN1PvZz2FJVRVmTpbeH aTgLEHOJtSY2Tr7wPEuKwC5Yw2sr3Riz479C7A/eSpBW7Nc/rKINm4VRAGwCpQ== Received: from smtp.theravensnest.org (smtp.theravensnest.org [45.77.103.195]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) (Authenticated sender: theraven) by smtp.freebsd.org (Postfix) with ESMTPSA id 4PFkdw6K6nz1BG6 for ; Mon, 13 Feb 2023 12:48:04 +0000 (UTC) (envelope-from theraven@FreeBSD.org) Received: from [10.164.186.150] (unknown [167.220.197.150]) by smtp.theravensnest.org (Postfix) with ESMTPSA id 4447A10610 for ; Mon, 13 Feb 2023 12:48:04 +0000 (GMT) Message-ID: <0a00f93d-adf7-1b70-3898-1377c1b582a2@FreeBSD.org> Date: Mon, 13 Feb 2023 12:48:03 +0000 List-Id: Technical discussions relating to FreeBSD List-Archive: https://lists.freebsd.org/archives/freebsd-hackers List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-hackers@freebsd.org MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Thunderbird/102.7.2 Subject: Re: CFT: snmalloc as libc malloc Content-Language: en-GB To: freebsd-hackers@freebsd.org References: <2f3dcda0-5135-290a-2dff-683b2e9fe271@FreeBSD.org> <2836EF0C-8B4B-441C-927B-3796457A3865@FreeBSD.org> From: David Chisnall In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-ThisMailContainsUnwantedMimeParts: N Thanks, that's good feedback. I didn't separate these originally because I didn't want a huge testing matrix, but I've now done so. I have split snmalloc into two options in the latest commit in that branch: - WITH_SNMALLOC (defaults on) enables snmalloc - WITH_SNMALLOC_MEMCPY (defaults on if snmalloc is enabled) - WITH_MALLOC_PRODUCTION (I believe this is off by default for -CURRENT, on for other branches) sets both snmalloc and jemalloc into the recommended mode for release builds It's good to compare snmalloc vs jemalloc in WITH_MALLOC_PRODUCTION modes. I'd be interested in seeing snmalloc *without* the production knobs against jemalloc *with*, because if we have a bit of performance headroom we could use it for better security features. I'd also like to see comparisons of WITH_SNMALLOC_MEMCPY with different settings of the LIBC_BOUNDS_CHECKS environment variable for any workloads that people care about. David On 09/02/2023 22:07, Mateusz Guzik wrote: > I'm going to write a proper reply some time this week. > > For now what matters is that this CFT mixes up different things: > - snmalloc > - memcpy as shipped with snmalloc > - bound checks for said memcpy > > Seeing as stock memcpy is deficient and the one provided by snmalloc > is not a factor when deciding whether to switch to snmalloc, imo this > should be reposted. I also note that at least on head stock jemalloc > ships with debug, which again people run with by default unless they > explicitly disabled it. > > So a better CFT would include instructions: > - how to make sure jemalloc does not have debug > - how to build snmalloc without affecting memcpy (in fact, it would be > best if the branch for now shipped *without* custom memcpy so that one > does not have to check what hapepned) > > frankly i don't know how to check the jemalloc thing, but I can spend > some time to find out if that helps > > On 2/9/23, David Chisnall wrote: >> On 9 Feb 2023, at 20:53, Mateusz Guzik wrote: >>> >>> First and foremost, perhaps I should clear up that I have no opinion >>> on snmalloc or it replacing jemalloc. I'm here only about the memcpy >>> thing. >>> >>> Inspecting assembly is what I intended to do, but got the compilation >>> problem. >> >> If you just want to look at the memcpy, you might do better using the >> upstream (CMake) build system, which builds a shim library that you can >> disassemble and look at. >> >>> So, as someone who worked on memcpy previously, I note the variant >>> currently implemented in libc is pessimal for sizes > 16 bytes because >>> it does not use SIMD. I do have plans to rectify that, but ENOTIME. >> >> That’s one of the benefits of the C++ implementation. We were able to try a >> few dozen variations in a morning. Writing a single one of those in >> assembly would take a similar amount of time. >> >> We were inspired by the automemcpy paper, which demonstrated that you can >> write memcpy implementations tuned to specific workloads in higher-level >> languages and get better performance. >> >>> I also note CPUs are incredibly picky when it comes to starting point >>> of the routine. The officialy recommended alignment of 16 bytes is >>> basically a tradeoff between total binary size and performance. Should >>> you move the routine at different 16 bytes intervals you will easily >>> find big variation in performance, depending on how big of an >>> alignment you ended up with. >> >> Yes, that’s what our test does. It does a large number of different copies >> with different sizes and alignments. >> >>> I tried to compile the benchmark but got bested by c++. I don't know >>> the lang and I don't want to fight it. >>> >>> $ pwd >>> /usr/src/contrib/snmalloc/src >>> $ c++ -I. test/perf/memcpy/memcpy.cc >>> [snip] >>> ./snmalloc/global/../backend/../backend_helpers/../mem/../ds_core/bits.h:57:26: >>> error: no template named 'is_integral_v' in namespace 'std'; did you >>> mean 'is_integral'? >>> static_assert(std::is_integral_v, "Type must be integral"); >>> ~~~~~^~~~~~~~~~~~~ >>> is_integral >> >> I think the only thing missing is -std=c++17. >> >>> I'm trying to say that if you end up with different funcs depending on >>> bounds checking and you only align them to 16 bytes, the benchmark is >>> most likely inaccurate if only for this reason. >> >> I’m not sure I follow this, sorry. >> >>>> The fastest on x86 is roughly: >>>> >>>> - A jump table of power for small sizes that do power-of-two-sized small >>>> copies (double-word, word, half-word, and byte) to perform the copy. >>> >>> Last I checked this was not true. The last slow thing to do was to >>> branch on few sizes and handle them with overlapping stores. Roughly >>> what memcpy in libc is doing now. >>> >>> Jump table aside, you got all sizes "spelled out", that is just >>> avoidable impact on icache. While overlapping stores do come with some >>> penalty, it is cheaper than the above combined. >> >> They’re not all spelled out, because a lot of them are subsets of the >> others. The compiler does a *very* good job of optimising this. The >> generated assembly was a lot more complex than anything I could write by >> hand. >> >>> I don't have numbers nor bench code handy, but if you insist on >>> contesting the above I'll be glad to provide them. >>> >>>> - A vectorised copy for medium-sized copies using a loop of SSE copies. >>> >>> Depends on what you mean by medium and which simd instructions, but >>> yes, they should be used here. >> >> This could probably do with more tuning, but all of this is configurable >> with a couple of template parameters. If it’s useful to have more variants >> for different microarchitectures then it’s a trivial tweak to generate >> them. >> >>>> - rep movsb for large copies. >>> >>> There are still old cpus here and there which don't support ERMS. They >>> are very negatively impacted the above and should roll with rep stosq >>> instead. >> >> We tested on some microarchitectures without FRMS but didn’t compare with >> rep stosq as an alternative. The rep movsb is inline assembly >> (https://github.com/microsoft/snmalloc/blob/4370a23f3e5f1d1d06967b1e0d6162bf1ee81196/src/snmalloc/global/memcpy.h#L351) >> so it would be quite easy to compare. Very happy to take patches that make >> things faster! >> >> PowerPC doesn’t have an equivalent of rep movsb, so the PowerPC version >> doesn’t have an equivalent codepath. >> >> Each tuned version is a dozen lines of code (plus a lot of comments to >> explain *why* it does things a particular way), so it’s very easy to add >> different versions with different tuning. >> >>> I see the code takes some care to align the target buffer. That's >>> good, but not necessary on CPUs with FSRM. >> >> It doesn’t hurt though, on the sizes where it’s actually beneficial to use >> the rep sequence the cost of alignment is negligible. >> >>> All that said, I have hard time believing the impact of bounds >>> checking on short copies is around 20% or so. The code to do it looks >>> super slow (not that I know to do better). >> >> The last time I looked at the disassembly, the code for the bounds check >> touched three cache lines and has two branches. It fits well in a >> superscalar pipeline so adds a handful of cycles. This is basically in the >> noise for large copies but is double-digit percentage overhead for small >> ones. >> >>> People like to claim short sizes are inlined by the compiler, but >>> that's only true if the size is known at compilation time, which it >>> often is not. Then they claim they are rare. >> >> I agree. >> >>> To illustrate why that's bogus, here is clang 15 compiling vfs_cache.c: >>> value ------------- Distribution ------------- count >>> -1 | 0 >>> 0 |@ 19758 >>> 1 |@@@@@@@@ 218420 >>> 2 |@@ 67670 >>> 4 |@@@@ 103914 >>> 8 |@@@@@@@@@@@ 301157 >>> 16 |@@@@@@@@@@ 293812 >>> 32 |@@ 57954 >>> 64 |@ 23818 >>> 128 | 13344 >>> 256 |@ 18507 >>> 512 | 6342 >>> 1024 | 1710 >>> 2048 | 627 >>> 4096 | 398 >>> 8192 | 34 >>> 16384 | 10 >>> 32768 | 6 >>> 65536 | 7 >>> 131072 | 4 >>> 262144 | 1 >>> 524288 | 1 >>> 1048576 | 0 >>> >>> obtained with this bad boy: >>> dtrace -n 'pid$target::memcpy:entry { @ = quantize(arg2); }' -c "cc >>> -target x86_64-unknown-freebsd14.0 >>> --sysroot=/usr/obj/usr/src/amd64.amd64/tmp >>> -B/usr/obj/usr/src/amd64.amd64/tmp/usr/bin -c -O2 -pipe >>> -fno-strict-aliasing -g -nostdinc -I. -I/usr/src/sys >>> -I/usr/src/sys/contrib/ck/include -I/usr/src/sys/contrib/libfdt >>> -D_KERNEL -DHAVE_KERNEL_OPTION_HEADERS -include opt_global.h >>> -fno-common -fno-omit-frame-pointer -mno-omit-leaf-frame-pointer >>> -MD -MF.depend.vfs_cache.o -MTvfs_cache.o >>> -fdebug-prefix-map=./machine=/usr/src/sys/amd64/include >>> -fdebug-prefix-map=./x86=/usr/src/sys/x86/include >>> -fdebug-prefix-map=./i386=/usr/src/sys/i386/include -mcmodel=kernel >>> -mno-red-zone -mno-mmx -mno-sse -msoft-float >>> -fno-asynchronous-unwind-tables -ffreestanding -fwrapv >>> -fstack-protector -Wall -Wnested-externs -Wstrict-prototypes >>> -Wmissing-prototypes -Wpointer-arith -Wcast-qual -Wundef >>> -Wno-pointer-sign -D__printf__=__freebsd_kprintf__ >>> -Wmissing-include-dirs -fdiagnostics-show-option -Wno-unknown-pragmas >>> -Wno-error=tautological-compare -Wno-error=empty-body >>> -Wno-error=parentheses-equality -Wno-error=unused-function >>> -Wno-error=pointer-sign -Wno-error=shift-negative-value >>> -Wno-address-of-packed-member -Wno-error=array-parameter >>> -Wno-error=deprecated-non-prototype -Wno-error=strict-prototypes >>> -Wno-error=unused-but-set-variable -Wno-format-zero-length -mno-aes >>> -mno-avx -std=iso9899:1999 -Werror /usr/src/sys/kern/vfs_cache.c” >> >> That’s a really helpful datapoint, thanks! That distribution matches roughly >> what we’ve seen - the number of zero-byte memcpy calls surprised me the >> first time I saw it, but the automemcpy paper reported something similar. >> >> If you can also extract the alignments from these then it would be fairly >> easy to modify the benchmark program to use those distributions. >> >> My original version extended the FreeBSD assembly memcpy with a call to a >> function that did the bounds checks, but that had *much* worse performance. >> We could insert assembly to do the bounds checks, but that’s hard to update >> if we change the metadata layout in malloc. >> >> We could modify the ifunc resolvers to use the current memcpy if you don’t >> care about bounds checks. I did a tiny amount of testing (mostly clang and >> lld) in this configuration and it was slower than using the snmalloc memcpy, >> but I didn’t do enough benchmarking to be confident of that. Hence the >> CFT. >> >> David >> >> > >