Re: CFT: snmalloc as libc malloc

From: David Chisnall <theraven_at_FreeBSD.org>
Date: Mon, 13 Feb 2023 12:48:03 UTC
Thanks, that's good feedback.

I didn't separate these originally because I didn't want a huge testing 
matrix, but I've now done so.

I have split snmalloc into two options in the latest commit in that branch:

  - WITH_SNMALLOC (defaults on) enables snmalloc

  - WITH_SNMALLOC_MEMCPY (defaults on if snmalloc is enabled)

  - WITH_MALLOC_PRODUCTION (I believe this is off by default for 
-CURRENT, on for other branches) sets both snmalloc and jemalloc into 
the recommended mode for release builds

It's good to compare snmalloc vs jemalloc in WITH_MALLOC_PRODUCTION 
modes.  I'd be interested in seeing snmalloc *without* the production 
knobs against jemalloc *with*, because if we have a bit of performance 
headroom we could use it for better security features.

I'd also like to see comparisons of WITH_SNMALLOC_MEMCPY with different 
settings of the LIBC_BOUNDS_CHECKS environment variable for any 
workloads that people care about.

David

On 09/02/2023 22:07, Mateusz Guzik wrote:
> I'm going to write a proper reply some time this week.
> 
> For now what matters is that this CFT mixes up different things:
> - snmalloc
> - memcpy as shipped with snmalloc
> - bound checks for said memcpy
> 
> Seeing as stock memcpy is deficient and the one provided by snmalloc
> is not a factor when deciding whether to switch to snmalloc, imo this
> should be reposted. I also note that at least on head stock jemalloc
> ships with debug, which again people run with by default unless they
> explicitly disabled it.
> 
> So a better CFT would include instructions:
> - how to make sure jemalloc does not have debug
> - how to build snmalloc without affecting memcpy (in fact, it would be
> best if the branch for now shipped *without* custom memcpy so that one
> does not have to check what hapepned)
> 
> frankly i don't know how to check the jemalloc thing, but I can spend
> some time to find out if that helps
> 
> On 2/9/23, David Chisnall <theraven@freebsd.org> wrote:
>> On 9 Feb 2023, at 20:53, Mateusz Guzik <mjguzik@gmail.com> wrote:
>>>
>>> First and foremost, perhaps I should clear up that I have no opinion
>>> on snmalloc or it replacing jemalloc. I'm here only about the memcpy
>>> thing.
>>>
>>> Inspecting assembly is what I intended to do, but got the compilation
>>> problem.
>>
>> If you just want to look at the memcpy, you might do better using the
>> upstream (CMake) build system, which builds a shim library that you can
>> disassemble and look at.
>>
>>> So, as someone who worked on memcpy previously, I note the variant
>>> currently implemented in libc is pessimal for sizes > 16 bytes because
>>> it does not use SIMD. I do have plans to rectify that, but ENOTIME.
>>
>> That’s one of the benefits of the C++ implementation.  We were able to try a
>> few dozen variations in a morning.  Writing a single one of those in
>> assembly would take a similar amount of time.
>>
>> We were inspired by the automemcpy paper, which demonstrated that you can
>> write memcpy implementations tuned to specific workloads in higher-level
>> languages and get better performance.
>>
>>> I also note CPUs are incredibly picky when it comes to starting point
>>> of the routine. The officialy recommended alignment of 16 bytes is
>>> basically a tradeoff between total binary size and performance. Should
>>> you move the routine at different 16 bytes intervals you will easily
>>> find big variation in performance, depending on how big of an
>>> alignment you ended up with.
>>
>> Yes, that’s what our test does.  It does a large number of different copies
>> with different sizes and alignments.
>>
>>> I tried to compile the benchmark but got bested by c++. I don't know
>>> the lang and I don't want to fight it.
>>>
>>> $ pwd
>>> /usr/src/contrib/snmalloc/src
>>> $ c++ -I. test/perf/memcpy/memcpy.cc
>>> [snip]
>>> ./snmalloc/global/../backend/../backend_helpers/../mem/../ds_core/bits.h:57:26:
>>> error: no template named 'is_integral_v' in namespace 'std'; did you
>>> mean 'is_integral'?
>>>       static_assert(std::is_integral_v<T>, "Type must be integral");
>>>                     ~~~~~^~~~~~~~~~~~~
>>>                          is_integral
>>
>> I think the only thing missing is -std=c++17.
>>
>>> I'm trying to say that if you end up with different funcs depending on
>>> bounds checking and you only align them to 16 bytes, the benchmark is
>>> most likely inaccurate if only for this reason.
>>
>> I’m not sure I follow this, sorry.
>>
>>>> The fastest on x86 is roughly:
>>>>
>>>> - A jump table of power for small sizes that do power-of-two-sized small
>>>> copies (double-word, word, half-word, and byte) to perform the copy.
>>>
>>> Last I checked this was not true. The last slow thing to do was to
>>> branch on few sizes and handle them with overlapping stores. Roughly
>>> what memcpy in libc is doing now.
>>>
>>> Jump table aside, you got all sizes "spelled out", that is just
>>> avoidable impact on icache. While overlapping stores do come with some
>>> penalty, it is cheaper than the above combined.
>>
>> They’re not all spelled out, because a lot of them are subsets of the
>> others.  The compiler does a *very* good job of optimising this.  The
>> generated assembly was a lot more complex than anything I could write by
>> hand.
>>
>>> I don't have numbers nor bench code handy, but if you insist on
>>> contesting the above I'll be glad to provide them.
>>>
>>>> - A vectorised copy for medium-sized copies using a loop of SSE copies.
>>>
>>> Depends on what you mean by medium and which simd instructions, but
>>> yes, they should be used here.
>>
>> This could probably do with more tuning, but all of this is configurable
>> with a couple of template parameters.  If it’s useful to have more variants
>> for different microarchitectures then it’s a trivial tweak to generate
>> them.
>>
>>>> - rep movsb for large copies.
>>>
>>> There are still old cpus here and there which don't support ERMS. They
>>> are very negatively impacted the above and should roll with rep stosq
>>> instead.
>>
>> We tested on some microarchitectures without FRMS but didn’t compare with
>> rep stosq as an alternative.  The rep movsb is inline assembly
>> (https://github.com/microsoft/snmalloc/blob/4370a23f3e5f1d1d06967b1e0d6162bf1ee81196/src/snmalloc/global/memcpy.h#L351)
>> so it would be quite easy to compare.  Very happy to take patches that make
>> things faster!
>>
>> PowerPC doesn’t have an equivalent of rep movsb, so the PowerPC version
>> doesn’t have an equivalent codepath.
>>
>> Each tuned version is a dozen lines of code (plus a lot of comments to
>> explain *why* it does things a particular way), so it’s very easy to add
>> different versions with different tuning.
>>
>>> I see the code takes some care to align the target buffer. That's
>>> good, but not necessary on CPUs with FSRM.
>>
>> It doesn’t hurt though, on the sizes where it’s actually beneficial to use
>> the rep sequence the cost of alignment is negligible.
>>
>>> All that said, I have hard time believing the impact of bounds
>>> checking on short copies is around 20% or so. The code to do it looks
>>> super slow (not that I know to do better).
>>
>> The last time I looked at the disassembly, the code for the bounds check
>> touched three cache lines and has two branches.  It fits well in a
>> superscalar pipeline so adds a handful of cycles.  This is basically in the
>> noise for large copies but is double-digit percentage overhead for small
>> ones.
>>
>>> People like to claim short sizes are inlined by the compiler, but
>>> that's only true if the size is known at compilation time, which it
>>> often is not. Then they claim they are rare.
>>
>> I agree.
>>
>>> To illustrate why that's bogus, here is clang 15 compiling vfs_cache.c:
>>>            value  ------------- Distribution ------------- count
>>>               -1 |                                         0
>>>                0 |@                                        19758
>>>                1 |@@@@@@@@                                 218420
>>>                2 |@@                                       67670
>>>                4 |@@@@                                     103914
>>>                8 |@@@@@@@@@@@                              301157
>>>               16 |@@@@@@@@@@                               293812
>>>               32 |@@                                       57954
>>>               64 |@                                        23818
>>>              128 |                                         13344
>>>              256 |@                                        18507
>>>              512 |                                         6342
>>>             1024 |                                         1710
>>>             2048 |                                         627
>>>             4096 |                                         398
>>>             8192 |                                         34
>>>            16384 |                                         10
>>>            32768 |                                         6
>>>            65536 |                                         7
>>>           131072 |                                         4
>>>           262144 |                                         1
>>>           524288 |                                         1
>>>          1048576 |                                         0
>>>
>>> obtained with this bad boy:
>>> dtrace -n 'pid$target::memcpy:entry { @ = quantize(arg2); }' -c "cc
>>> -target x86_64-unknown-freebsd14.0
>>> --sysroot=/usr/obj/usr/src/amd64.amd64/tmp
>>> -B/usr/obj/usr/src/amd64.amd64/tmp/usr/bin -c -O2 -pipe
>>> -fno-strict-aliasing  -g -nostdinc  -I. -I/usr/src/sys
>>> -I/usr/src/sys/contrib/ck/include -I/usr/src/sys/contrib/libfdt
>>> -D_KERNEL -DHAVE_KERNEL_OPTION_HEADERS -include opt_global.h
>>> -fno-common    -fno-omit-frame-pointer -mno-omit-leaf-frame-pointer
>>> -MD  -MF.depend.vfs_cache.o -MTvfs_cache.o
>>> -fdebug-prefix-map=./machine=/usr/src/sys/amd64/include
>>> -fdebug-prefix-map=./x86=/usr/src/sys/x86/include
>>> -fdebug-prefix-map=./i386=/usr/src/sys/i386/include -mcmodel=kernel
>>> -mno-red-zone -mno-mmx -mno-sse -msoft-float
>>> -fno-asynchronous-unwind-tables -ffreestanding -fwrapv
>>> -fstack-protector -Wall -Wnested-externs -Wstrict-prototypes
>>> -Wmissing-prototypes -Wpointer-arith -Wcast-qual -Wundef
>>> -Wno-pointer-sign -D__printf__=__freebsd_kprintf__
>>> -Wmissing-include-dirs -fdiagnostics-show-option -Wno-unknown-pragmas
>>> -Wno-error=tautological-compare -Wno-error=empty-body
>>> -Wno-error=parentheses-equality -Wno-error=unused-function
>>> -Wno-error=pointer-sign -Wno-error=shift-negative-value
>>> -Wno-address-of-packed-member -Wno-error=array-parameter
>>> -Wno-error=deprecated-non-prototype -Wno-error=strict-prototypes
>>> -Wno-error=unused-but-set-variable -Wno-format-zero-length   -mno-aes
>>> -mno-avx  -std=iso9899:1999 -Werror /usr/src/sys/kern/vfs_cache.c”
>>
>> That’s a really helpful datapoint, thanks! That distribution matches roughly
>> what we’ve seen - the number of zero-byte memcpy calls surprised me the
>> first time I saw it, but the automemcpy paper reported something similar.
>>
>> If you can also extract the alignments from these then it would be fairly
>> easy to modify the benchmark program to use those distributions.
>>
>> My original version extended the FreeBSD assembly memcpy with a call to a
>> function that did the bounds checks, but that had *much* worse performance.
>> We could insert assembly to do the bounds checks, but that’s hard to update
>> if we change the metadata layout in malloc.
>>
>> We could modify the ifunc resolvers to use the current memcpy if you don’t
>> care about bounds checks.  I did a tiny amount of testing (mostly clang and
>> lld) in this configuration and it was slower than using the snmalloc memcpy,
>> but I didn’t do enough benchmarking to be confident of that.  Hence the
>> CFT.
>>
>> David
>>
>>
> 
>