Re: CFT: snmalloc as libc malloc

From: David Chisnall <theraven_at_FreeBSD.org>
Date: Thu, 09 Feb 2023 21:38:40 UTC
On 9 Feb 2023, at 20:53, Mateusz Guzik <mjguzik@gmail.com> wrote:
> 
> First and foremost, perhaps I should clear up that I have no opinion
> on snmalloc or it replacing jemalloc. I'm here only about the memcpy
> thing.
> 
> Inspecting assembly is what I intended to do, but got the compilation problem.

If you just want to look at the memcpy, you might do better using the upstream (CMake) build system, which builds a shim library that you can disassemble and look at.

> So, as someone who worked on memcpy previously, I note the variant
> currently implemented in libc is pessimal for sizes > 16 bytes because
> it does not use SIMD. I do have plans to rectify that, but ENOTIME.

That’s one of the benefits of the C++ implementation.  We were able to try a few dozen variations in a morning.  Writing a single one of those in assembly would take a similar amount of time.

We were inspired by the automemcpy paper, which demonstrated that you can write memcpy implementations tuned to specific workloads in higher-level languages and get better performance.

> I also note CPUs are incredibly picky when it comes to starting point
> of the routine. The officialy recommended alignment of 16 bytes is
> basically a tradeoff between total binary size and performance. Should
> you move the routine at different 16 bytes intervals you will easily
> find big variation in performance, depending on how big of an
> alignment you ended up with.

Yes, that’s what our test does.  It does a large number of different copies with different sizes and alignments.  

> I tried to compile the benchmark but got bested by c++. I don't know
> the lang and I don't want to fight it.
> 
> $ pwd
> /usr/src/contrib/snmalloc/src
> $ c++ -I. test/perf/memcpy/memcpy.cc
> [snip]
> ./snmalloc/global/../backend/../backend_helpers/../mem/../ds_core/bits.h:57:26:
> error: no template named 'is_integral_v' in namespace 'std'; did you
> mean 'is_integral'?
>      static_assert(std::is_integral_v<T>, "Type must be integral");
>                    ~~~~~^~~~~~~~~~~~~
>                         is_integral

I think the only thing missing is -std=c++17.

> I'm trying to say that if you end up with different funcs depending on
> bounds checking and you only align them to 16 bytes, the benchmark is
> most likely inaccurate if only for this reason.

I’m not sure I follow this, sorry.

>> The fastest on x86 is roughly:
>> 
>> - A jump table of power for small sizes that do power-of-two-sized small
>> copies (double-word, word, half-word, and byte) to perform the copy.
> 
> Last I checked this was not true. The last slow thing to do was to
> branch on few sizes and handle them with overlapping stores. Roughly
> what memcpy in libc is doing now.
> 
> Jump table aside, you got all sizes "spelled out", that is just
> avoidable impact on icache. While overlapping stores do come with some
> penalty, it is cheaper than the above combined.

They’re not all spelled out, because a lot of them are subsets of the others.  The compiler does a *very* good job of optimising this.  The generated assembly was a lot more complex than anything I could write by hand.

> I don't have numbers nor bench code handy, but if you insist on
> contesting the above I'll be glad to provide them.
> 
>> - A vectorised copy for medium-sized copies using a loop of SSE copies.
> 
> Depends on what you mean by medium and which simd instructions, but
> yes, they should be used here.

This could probably do with more tuning, but all of this is configurable with a couple of template parameters.  If it’s useful to have more variants for different microarchitectures then it’s a trivial tweak to generate them.

>> - rep movsb for large copies.
> 
> There are still old cpus here and there which don't support ERMS. They
> are very negatively impacted the above and should roll with rep stosq
> instead.

We tested on some microarchitectures without FRMS but didn’t compare with rep stosq as an alternative.  The rep movsb is inline assembly (https://github.com/microsoft/snmalloc/blob/4370a23f3e5f1d1d06967b1e0d6162bf1ee81196/src/snmalloc/global/memcpy.h#L351) so it would be quite easy to compare.  Very happy to take patches that make things faster!

PowerPC doesn’t have an equivalent of rep movsb, so the PowerPC version doesn’t have an equivalent codepath.

Each tuned version is a dozen lines of code (plus a lot of comments to explain *why* it does things a particular way), so it’s very easy to add different versions with different tuning.  

> I see the code takes some care to align the target buffer. That's
> good, but not necessary on CPUs with FSRM.

It doesn’t hurt though, on the sizes where it’s actually beneficial to use the rep sequence the cost of alignment is negligible.

> All that said, I have hard time believing the impact of bounds
> checking on short copies is around 20% or so. The code to do it looks
> super slow (not that I know to do better).

The last time I looked at the disassembly, the code for the bounds check touched three cache lines and has two branches.  It fits well in a superscalar pipeline so adds a handful of cycles.  This is basically in the noise for large copies but is double-digit percentage overhead for small ones.

> People like to claim short sizes are inlined by the compiler, but
> that's only true if the size is known at compilation time, which it
> often is not. Then they claim they are rare.

I agree.

> To illustrate why that's bogus, here is clang 15 compiling vfs_cache.c:
>           value  ------------- Distribution ------------- count
>              -1 |                                         0
>               0 |@                                        19758
>               1 |@@@@@@@@                                 218420
>               2 |@@                                       67670
>               4 |@@@@                                     103914
>               8 |@@@@@@@@@@@                              301157
>              16 |@@@@@@@@@@                               293812
>              32 |@@                                       57954
>              64 |@                                        23818
>             128 |                                         13344
>             256 |@                                        18507
>             512 |                                         6342
>            1024 |                                         1710
>            2048 |                                         627
>            4096 |                                         398
>            8192 |                                         34
>           16384 |                                         10
>           32768 |                                         6
>           65536 |                                         7
>          131072 |                                         4
>          262144 |                                         1
>          524288 |                                         1
>         1048576 |                                         0
> 
> obtained with this bad boy:
> dtrace -n 'pid$target::memcpy:entry { @ = quantize(arg2); }' -c "cc
> -target x86_64-unknown-freebsd14.0
> --sysroot=/usr/obj/usr/src/amd64.amd64/tmp
> -B/usr/obj/usr/src/amd64.amd64/tmp/usr/bin -c -O2 -pipe
> -fno-strict-aliasing  -g -nostdinc  -I. -I/usr/src/sys
> -I/usr/src/sys/contrib/ck/include -I/usr/src/sys/contrib/libfdt
> -D_KERNEL -DHAVE_KERNEL_OPTION_HEADERS -include opt_global.h
> -fno-common    -fno-omit-frame-pointer -mno-omit-leaf-frame-pointer
> -MD  -MF.depend.vfs_cache.o -MTvfs_cache.o
> -fdebug-prefix-map=./machine=/usr/src/sys/amd64/include
> -fdebug-prefix-map=./x86=/usr/src/sys/x86/include
> -fdebug-prefix-map=./i386=/usr/src/sys/i386/include -mcmodel=kernel
> -mno-red-zone -mno-mmx -mno-sse -msoft-float
> -fno-asynchronous-unwind-tables -ffreestanding -fwrapv
> -fstack-protector -Wall -Wnested-externs -Wstrict-prototypes
> -Wmissing-prototypes -Wpointer-arith -Wcast-qual -Wundef
> -Wno-pointer-sign -D__printf__=__freebsd_kprintf__
> -Wmissing-include-dirs -fdiagnostics-show-option -Wno-unknown-pragmas
> -Wno-error=tautological-compare -Wno-error=empty-body
> -Wno-error=parentheses-equality -Wno-error=unused-function
> -Wno-error=pointer-sign -Wno-error=shift-negative-value
> -Wno-address-of-packed-member -Wno-error=array-parameter
> -Wno-error=deprecated-non-prototype -Wno-error=strict-prototypes
> -Wno-error=unused-but-set-variable -Wno-format-zero-length   -mno-aes
> -mno-avx  -std=iso9899:1999 -Werror /usr/src/sys/kern/vfs_cache.c”

That’s a really helpful datapoint, thanks! That distribution matches roughly what we’ve seen - the number of zero-byte memcpy calls surprised me the first time I saw it, but the automemcpy paper reported something similar.

If you can also extract the alignments from these then it would be fairly easy to modify the benchmark program to use those distributions.

My original version extended the FreeBSD assembly memcpy with a call to a function that did the bounds checks, but that had *much* worse performance.  We could insert assembly to do the bounds checks, but that’s hard to update if we change the metadata layout in malloc.

We could modify the ifunc resolvers to use the current memcpy if you don’t care about bounds checks.  I did a tiny amount of testing (mostly clang and lld) in this configuration and it was slower than using the snmalloc memcpy, but I didn’t do enough benchmarking to be confident of that.  Hence the CFT.

David