From nobody Mon Feb 13 12:48:03 2023
X-Original-To: freebsd-hackers@mlmmj.nyi.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
	by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4PFkdx0pkVz3qTnT
	for <freebsd-hackers@mlmmj.nyi.freebsd.org>; Mon, 13 Feb 2023 12:48:05 +0000 (UTC)
	(envelope-from theraven@FreeBSD.org)
Received: from smtp.freebsd.org (smtp.freebsd.org [96.47.72.83])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256
	 client-signature RSA-PSS (4096 bits) client-digest SHA256)
	(Client CN "smtp.freebsd.org", Issuer "R3" (verified OK))
	by mx1.freebsd.org (Postfix) with ESMTPS id 4PFkdx0J0Zz40Y8
	for <freebsd-hackers@freebsd.org>; Mon, 13 Feb 2023 12:48:05 +0000 (UTC)
	(envelope-from theraven@FreeBSD.org)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim;
	t=1676292485;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=GV7FdpbAhcp06dBq3fmhbjB84tWg2juRZA2GoxRwCtw=;
	b=rwJhG+MgFbq9bJuyyOVYpdSBh1wtBsem51kLSg5Z92Zsmj8rgHUagK66LhM+fsfwKyMk7Q
	C1ymKU5IU0DXpxqbLGrITRwdGRQ01MA0cX877pB5Ag75mdqj7k4Q+ghWJY6/5BeaCiI5EK
	+jacZC0t9NrWZRL+SqAPh+uVk+0REB2SLOgbKQrttxDSi+u4nzBnCrozwVcOkR5v6bZ6dp
	WZCDJc6YTA1qHD6H5pj32WfD4+9h+JR3OEtmQWItM2neuo042ftWuVGbh+/6++wBqm/x5O
	9e/C+3SCP5uyHsRRPX8CXk+VETf3xAAPp0hFULA3b/MrsrBdcKZIln/JB00hMw==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org;
	s=dkim; t=1676292485;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=GV7FdpbAhcp06dBq3fmhbjB84tWg2juRZA2GoxRwCtw=;
	b=OnK6KqHDoDQRVOVXsX7SARFEHeNIWrXX11uhsbWvJdTkfN3JtLPLEKxrkQdARxF+a2sYZQ
	QTQkho39BjHYKMCABUfzQuV5yunhLV2fEnrEliNNfDDkTbYncRdpPaO2zqYhvRmigvQvCG
	ajjKw7IKps/y9yfKQzw9rCfQlC7l11XWOfyC8I0IsJSHSlqpe2Uh7BwQqI8GCZWtpJdTkz
	+X8r7qoauBvo+FB9o/inH9iqN/gTMrRMOJj22dFlR9BM2FadGf3pfvKnII11Ed6ZQAaITW
	8KWYMcepBL8d5xq8+dHTkjt+Zdwat8e6I2SNngpgEShmi/ENoBGjSfIq4H+lOg==
ARC-Authentication-Results: i=1;
	mx1.freebsd.org;
	none
ARC-Seal: i=1; s=dkim; d=freebsd.org; t=1676292485; a=rsa-sha256; cv=none;
	b=OhEycdioKDIdd5xD6h1Qi7er62LTx9N+qzDCkCmXSfTaN1H5fih5r0YgmT6Av6cbG8suWd
	GiF2CSXEJRvb0CnAwG4QKetFhbmJyAmdbcTgAvhxwg6esKvSnDyow9nxbkv4Pqg/2priDv
	vezxMmsDoijddRnUNFHDw18wc6+O4++0DMcjWt5Ia3opYI1Xcugf+912/FxttL+HfwkHVt
	sfoYE2cTArgQzZS+RjxUNvLPKhqrroawzIoiDfgMPUe9yql7L0pjN1PvZz2FJVRVmTpbeH
	aTgLEHOJtSY2Tr7wPEuKwC5Yw2sr3Riz479C7A/eSpBW7Nc/rKINm4VRAGwCpQ==
Received: from smtp.theravensnest.org (smtp.theravensnest.org [45.77.103.195])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256)
	(Client did not present a certificate)
	(Authenticated sender: theraven)
	by smtp.freebsd.org (Postfix) with ESMTPSA id 4PFkdw6K6nz1BG6
	for <freebsd-hackers@freebsd.org>; Mon, 13 Feb 2023 12:48:04 +0000 (UTC)
	(envelope-from theraven@FreeBSD.org)
Received: from [10.164.186.150] (unknown [167.220.197.150])
	by smtp.theravensnest.org (Postfix) with ESMTPSA id 4447A10610
	for <freebsd-hackers@freebsd.org>; Mon, 13 Feb 2023 12:48:04 +0000 (GMT)
Message-ID: <0a00f93d-adf7-1b70-3898-1377c1b582a2@FreeBSD.org>
Date: Mon, 13 Feb 2023 12:48:03 +0000
List-Id: Technical discussions relating to FreeBSD <freebsd-hackers.freebsd.org>
List-Archive: https://lists.freebsd.org/archives/freebsd-hackers
List-Help: <mailto:freebsd-hackers+help@freebsd.org>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Subscribe: <mailto:freebsd-hackers+subscribe@freebsd.org>
List-Unsubscribe: <mailto:freebsd-hackers+unsubscribe@freebsd.org>
Sender: owner-freebsd-hackers@freebsd.org
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
 Thunderbird/102.7.2
Subject: Re: CFT: snmalloc as libc malloc
Content-Language: en-GB
To: freebsd-hackers@freebsd.org
References: <2f3dcda0-5135-290a-2dff-683b2e9fe271@FreeBSD.org>
 <CAGudoHFYMLk6EDrSxLiWFNBoYyTKXfHLAUhZC+RF4eUE-rip8Q@mail.gmail.com>
 <E140A3A2-5C4A-4458-B365-AD693AB853E8@FreeBSD.org>
 <CAGudoHFeFm1K+JSBXaxt2SNTv-tGFgT0onyErOCmBdVjmHaxUg@mail.gmail.com>
 <2836EF0C-8B4B-441C-927B-3796457A3865@FreeBSD.org>
 <CAGudoHGvWMjSOwHkDzYB8171D_bsU_CEBfpAMjjKBy67KOORrw@mail.gmail.com>
From: David Chisnall <theraven@FreeBSD.org>
In-Reply-To: <CAGudoHGvWMjSOwHkDzYB8171D_bsU_CEBfpAMjjKBy67KOORrw@mail.gmail.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
X-ThisMailContainsUnwantedMimeParts: N

Thanks, that's good feedback.

I didn't separate these originally because I didn't want a huge testing 
matrix, but I've now done so.

I have split snmalloc into two options in the latest commit in that branch:

  - WITH_SNMALLOC (defaults on) enables snmalloc

  - WITH_SNMALLOC_MEMCPY (defaults on if snmalloc is enabled)

  - WITH_MALLOC_PRODUCTION (I believe this is off by default for 
-CURRENT, on for other branches) sets both snmalloc and jemalloc into 
the recommended mode for release builds

It's good to compare snmalloc vs jemalloc in WITH_MALLOC_PRODUCTION 
modes.  I'd be interested in seeing snmalloc *without* the production 
knobs against jemalloc *with*, because if we have a bit of performance 
headroom we could use it for better security features.

I'd also like to see comparisons of WITH_SNMALLOC_MEMCPY with different 
settings of the LIBC_BOUNDS_CHECKS environment variable for any 
workloads that people care about.

David

On 09/02/2023 22:07, Mateusz Guzik wrote:
> I'm going to write a proper reply some time this week.
> 
> For now what matters is that this CFT mixes up different things:
> - snmalloc
> - memcpy as shipped with snmalloc
> - bound checks for said memcpy
> 
> Seeing as stock memcpy is deficient and the one provided by snmalloc
> is not a factor when deciding whether to switch to snmalloc, imo this
> should be reposted. I also note that at least on head stock jemalloc
> ships with debug, which again people run with by default unless they
> explicitly disabled it.
> 
> So a better CFT would include instructions:
> - how to make sure jemalloc does not have debug
> - how to build snmalloc without affecting memcpy (in fact, it would be
> best if the branch for now shipped *without* custom memcpy so that one
> does not have to check what hapepned)
> 
> frankly i don't know how to check the jemalloc thing, but I can spend
> some time to find out if that helps
> 
> On 2/9/23, David Chisnall <theraven@freebsd.org> wrote:
>> On 9 Feb 2023, at 20:53, Mateusz Guzik <mjguzik@gmail.com> wrote:
>>>
>>> First and foremost, perhaps I should clear up that I have no opinion
>>> on snmalloc or it replacing jemalloc. I'm here only about the memcpy
>>> thing.
>>>
>>> Inspecting assembly is what I intended to do, but got the compilation
>>> problem.
>>
>> If you just want to look at the memcpy, you might do better using the
>> upstream (CMake) build system, which builds a shim library that you can
>> disassemble and look at.
>>
>>> So, as someone who worked on memcpy previously, I note the variant
>>> currently implemented in libc is pessimal for sizes > 16 bytes because
>>> it does not use SIMD. I do have plans to rectify that, but ENOTIME.
>>
>> That’s one of the benefits of the C++ implementation.  We were able to try a
>> few dozen variations in a morning.  Writing a single one of those in
>> assembly would take a similar amount of time.
>>
>> We were inspired by the automemcpy paper, which demonstrated that you can
>> write memcpy implementations tuned to specific workloads in higher-level
>> languages and get better performance.
>>
>>> I also note CPUs are incredibly picky when it comes to starting point
>>> of the routine. The officialy recommended alignment of 16 bytes is
>>> basically a tradeoff between total binary size and performance. Should
>>> you move the routine at different 16 bytes intervals you will easily
>>> find big variation in performance, depending on how big of an
>>> alignment you ended up with.
>>
>> Yes, that’s what our test does.  It does a large number of different copies
>> with different sizes and alignments.
>>
>>> I tried to compile the benchmark but got bested by c++. I don't know
>>> the lang and I don't want to fight it.
>>>
>>> $ pwd
>>> /usr/src/contrib/snmalloc/src
>>> $ c++ -I. test/perf/memcpy/memcpy.cc
>>> [snip]
>>> ./snmalloc/global/../backend/../backend_helpers/../mem/../ds_core/bits.h:57:26:
>>> error: no template named 'is_integral_v' in namespace 'std'; did you
>>> mean 'is_integral'?
>>>       static_assert(std::is_integral_v<T>, "Type must be integral");
>>>                     ~~~~~^~~~~~~~~~~~~
>>>                          is_integral
>>
>> I think the only thing missing is -std=c++17.
>>
>>> I'm trying to say that if you end up with different funcs depending on
>>> bounds checking and you only align them to 16 bytes, the benchmark is
>>> most likely inaccurate if only for this reason.
>>
>> I’m not sure I follow this, sorry.
>>
>>>> The fastest on x86 is roughly:
>>>>
>>>> - A jump table of power for small sizes that do power-of-two-sized small
>>>> copies (double-word, word, half-word, and byte) to perform the copy.
>>>
>>> Last I checked this was not true. The last slow thing to do was to
>>> branch on few sizes and handle them with overlapping stores. Roughly
>>> what memcpy in libc is doing now.
>>>
>>> Jump table aside, you got all sizes "spelled out", that is just
>>> avoidable impact on icache. While overlapping stores do come with some
>>> penalty, it is cheaper than the above combined.
>>
>> They’re not all spelled out, because a lot of them are subsets of the
>> others.  The compiler does a *very* good job of optimising this.  The
>> generated assembly was a lot more complex than anything I could write by
>> hand.
>>
>>> I don't have numbers nor bench code handy, but if you insist on
>>> contesting the above I'll be glad to provide them.
>>>
>>>> - A vectorised copy for medium-sized copies using a loop of SSE copies.
>>>
>>> Depends on what you mean by medium and which simd instructions, but
>>> yes, they should be used here.
>>
>> This could probably do with more tuning, but all of this is configurable
>> with a couple of template parameters.  If it’s useful to have more variants
>> for different microarchitectures then it’s a trivial tweak to generate
>> them.
>>
>>>> - rep movsb for large copies.
>>>
>>> There are still old cpus here and there which don't support ERMS. They
>>> are very negatively impacted the above and should roll with rep stosq
>>> instead.
>>
>> We tested on some microarchitectures without FRMS but didn’t compare with
>> rep stosq as an alternative.  The rep movsb is inline assembly
>> (https://github.com/microsoft/snmalloc/blob/4370a23f3e5f1d1d06967b1e0d6162bf1ee81196/src/snmalloc/global/memcpy.h#L351)
>> so it would be quite easy to compare.  Very happy to take patches that make
>> things faster!
>>
>> PowerPC doesn’t have an equivalent of rep movsb, so the PowerPC version
>> doesn’t have an equivalent codepath.
>>
>> Each tuned version is a dozen lines of code (plus a lot of comments to
>> explain *why* it does things a particular way), so it’s very easy to add
>> different versions with different tuning.
>>
>>> I see the code takes some care to align the target buffer. That's
>>> good, but not necessary on CPUs with FSRM.
>>
>> It doesn’t hurt though, on the sizes where it’s actually beneficial to use
>> the rep sequence the cost of alignment is negligible.
>>
>>> All that said, I have hard time believing the impact of bounds
>>> checking on short copies is around 20% or so. The code to do it looks
>>> super slow (not that I know to do better).
>>
>> The last time I looked at the disassembly, the code for the bounds check
>> touched three cache lines and has two branches.  It fits well in a
>> superscalar pipeline so adds a handful of cycles.  This is basically in the
>> noise for large copies but is double-digit percentage overhead for small
>> ones.
>>
>>> People like to claim short sizes are inlined by the compiler, but
>>> that's only true if the size is known at compilation time, which it
>>> often is not. Then they claim they are rare.
>>
>> I agree.
>>
>>> To illustrate why that's bogus, here is clang 15 compiling vfs_cache.c:
>>>            value  ------------- Distribution ------------- count
>>>               -1 |                                         0
>>>                0 |@                                        19758
>>>                1 |@@@@@@@@                                 218420
>>>                2 |@@                                       67670
>>>                4 |@@@@                                     103914
>>>                8 |@@@@@@@@@@@                              301157
>>>               16 |@@@@@@@@@@                               293812
>>>               32 |@@                                       57954
>>>               64 |@                                        23818
>>>              128 |                                         13344
>>>              256 |@                                        18507
>>>              512 |                                         6342
>>>             1024 |                                         1710
>>>             2048 |                                         627
>>>             4096 |                                         398
>>>             8192 |                                         34
>>>            16384 |                                         10
>>>            32768 |                                         6
>>>            65536 |                                         7
>>>           131072 |                                         4
>>>           262144 |                                         1
>>>           524288 |                                         1
>>>          1048576 |                                         0
>>>
>>> obtained with this bad boy:
>>> dtrace -n 'pid$target::memcpy:entry { @ = quantize(arg2); }' -c "cc
>>> -target x86_64-unknown-freebsd14.0
>>> --sysroot=/usr/obj/usr/src/amd64.amd64/tmp
>>> -B/usr/obj/usr/src/amd64.amd64/tmp/usr/bin -c -O2 -pipe
>>> -fno-strict-aliasing  -g -nostdinc  -I. -I/usr/src/sys
>>> -I/usr/src/sys/contrib/ck/include -I/usr/src/sys/contrib/libfdt
>>> -D_KERNEL -DHAVE_KERNEL_OPTION_HEADERS -include opt_global.h
>>> -fno-common    -fno-omit-frame-pointer -mno-omit-leaf-frame-pointer
>>> -MD  -MF.depend.vfs_cache.o -MTvfs_cache.o
>>> -fdebug-prefix-map=./machine=/usr/src/sys/amd64/include
>>> -fdebug-prefix-map=./x86=/usr/src/sys/x86/include
>>> -fdebug-prefix-map=./i386=/usr/src/sys/i386/include -mcmodel=kernel
>>> -mno-red-zone -mno-mmx -mno-sse -msoft-float
>>> -fno-asynchronous-unwind-tables -ffreestanding -fwrapv
>>> -fstack-protector -Wall -Wnested-externs -Wstrict-prototypes
>>> -Wmissing-prototypes -Wpointer-arith -Wcast-qual -Wundef
>>> -Wno-pointer-sign -D__printf__=__freebsd_kprintf__
>>> -Wmissing-include-dirs -fdiagnostics-show-option -Wno-unknown-pragmas
>>> -Wno-error=tautological-compare -Wno-error=empty-body
>>> -Wno-error=parentheses-equality -Wno-error=unused-function
>>> -Wno-error=pointer-sign -Wno-error=shift-negative-value
>>> -Wno-address-of-packed-member -Wno-error=array-parameter
>>> -Wno-error=deprecated-non-prototype -Wno-error=strict-prototypes
>>> -Wno-error=unused-but-set-variable -Wno-format-zero-length   -mno-aes
>>> -mno-avx  -std=iso9899:1999 -Werror /usr/src/sys/kern/vfs_cache.c”
>>
>> That’s a really helpful datapoint, thanks! That distribution matches roughly
>> what we’ve seen - the number of zero-byte memcpy calls surprised me the
>> first time I saw it, but the automemcpy paper reported something similar.
>>
>> If you can also extract the alignments from these then it would be fairly
>> easy to modify the benchmark program to use those distributions.
>>
>> My original version extended the FreeBSD assembly memcpy with a call to a
>> function that did the bounds checks, but that had *much* worse performance.
>> We could insert assembly to do the bounds checks, but that’s hard to update
>> if we change the metadata layout in malloc.
>>
>> We could modify the ifunc resolvers to use the current memcpy if you don’t
>> care about bounds checks.  I did a tiny amount of testing (mostly clang and
>> lld) in this configuration and it was slower than using the snmalloc memcpy,
>> but I didn’t do enough benchmarking to be confident of that.  Hence the
>> CFT.
>>
>> David
>>
>>
> 
>