Re: CFT: snmalloc as libc malloc

From: Mateusz Guzik <mjguzik_at_gmail.com>
Date: Thu, 09 Feb 2023 20:53:34 UTC
On 2/9/23, David Chisnall <theraven@freebsd.org> wrote:
> On 9 Feb 2023, at 19:15, Mateusz Guzik <mjguzik@gmail.com> wrote:
>>
>> it fails to build for me:
>>
>> /usr/src/lib/libc/stdlib/snmalloc/malloc.cc:35:10: fatal error:
>> 'override/jemalloc_compat.cc' file not found
>> #include "override/jemalloc_compat.cc"
>>         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> 1 error generated.
>> --- malloc.o ---
>> *** [malloc.o] Error code 1
>>
>> make[4]: stopped in /usr/src/lib/libc
>> /usr/src/lib/libc/stdlib/snmalloc/memcpy.cc:25:10: fatal error:
>> 'global/memcpy.h' file not found
>> #include <global/memcpy.h>
>>         ^~~~~~~~~~~~~~~~~
>> 1 error generated.
>> --- memcpy.o ---
>> *** [memcpy.o] Error code 1
>
> This looks as if you haven’t got the submodule?  Is there anything in
> contrib/snmalloc?
>

indeed, a pilot error

>> anyway, I wanted to say I find the memcpy thing incredibly suspicious.
>> I found one article in
>> https://github.com/microsoft/snmalloc/blob/main/docs/security/GuardedMemcpy.md
>> which benches it and that made it even more suspicious. How did the
>> benched memcpy look like inside?
>
> Perhaps you could share what you are suspicious about?  I don’t really know
> how to respond to something so vague.  The document you linked to has the
> benchmark that we used (though the graphs in it appear to be based on an
> older version of the memcpy).  The PR that added PowerPC tuning has some
> additional graphs of measurements.
>
> If you compile the memcpy file, you can see the assembly.  The C++ provides
> a set of building blocks for producing efficient memcpy implementations.

First and foremost, perhaps I should clear up that I have no opinion
on snmalloc or it replacing jemalloc. I'm here only about the memcpy
thing.

Inspecting assembly is what I intended to do, but got the compilation problem.

So, as someone who worked on memcpy previously, I note the variant
currently implemented in libc is pessimal for sizes > 16 bytes because
it does not use SIMD. I do have plans to rectify that, but ENOTIME.

I also note CPUs are incredibly picky when it comes to starting point
of the routine. The officialy recommended alignment of 16 bytes is
basically a tradeoff between total binary size and performance. Should
you move the routine at different 16 bytes intervals you will easily
find big variation in performance, depending on how big of an
alignment you ended up with.

I tried to compile the benchmark but got bested by c++. I don't know
the lang and I don't want to fight it.

$ pwd
/usr/src/contrib/snmalloc/src
$ c++ -I. test/perf/memcpy/memcpy.cc
[snip]
./snmalloc/global/../backend/../backend_helpers/../mem/../ds_core/bits.h:57:26:
error: no template named 'is_integral_v' in namespace 'std'; did you
mean 'is_integral'?
      static_assert(std::is_integral_v<T>, "Type must be integral");
                    ~~~~~^~~~~~~~~~~~~
                         is_integral

and tons of other errors. I did buildworld + installworld.

I'm trying to say that if you end up with different funcs depending on
bounds checking and you only align them to 16 bytes, the benchmark is
most likely inaccurate if only for this reason.

> The fastest on x86 is roughly:
>
>  - A jump table of power for small sizes that do power-of-two-sized small
> copies (double-word, word, half-word, and byte) to perform the copy.

Last I checked this was not true. The last slow thing to do was to
branch on few sizes and handle them with overlapping stores. Roughly
what memcpy in libc is doing now.

Jump table aside, you got all sizes "spelled out", that is just
avoidable impact on icache. While overlapping stores do come with some
penalty, it is cheaper than the above combined.

I don't have numbers nor bench code handy, but if you insist on
contesting the above I'll be glad to provide them.

>  - A vectorised copy for medium-sized copies using a loop of SSE copies.

Depends on what you mean by medium and which simd instructions, but
yes, they should be used here.

>  - rep movsb for large copies.

There are still old cpus here and there which don't support ERMS. They
are very negatively impacted the above and should roll with rep stosq
instead.

I see the code takes some care to align the target buffer. That's
good, but not necessary on CPUs with FSRM.

All that said, I have hard time believing the impact of bounds
checking on short copies is around 20% or so. The code to do it looks
super slow (not that I know to do better).

People like to claim short sizes are inlined by the compiler, but
that's only true if the size is known at compilation time, which it
often is not. Then they claim they are rare.

To illustrate why that's bogus, here is clang 15 compiling vfs_cache.c:
           value  ------------- Distribution ------------- count
              -1 |                                         0
               0 |@                                        19758
               1 |@@@@@@@@                                 218420
               2 |@@                                       67670
               4 |@@@@                                     103914
               8 |@@@@@@@@@@@                              301157
              16 |@@@@@@@@@@                               293812
              32 |@@                                       57954
              64 |@                                        23818
             128 |                                         13344
             256 |@                                        18507
             512 |                                         6342
            1024 |                                         1710
            2048 |                                         627
            4096 |                                         398
            8192 |                                         34
           16384 |                                         10
           32768 |                                         6
           65536 |                                         7
          131072 |                                         4
          262144 |                                         1
          524288 |                                         1
         1048576 |                                         0

obtained with this bad boy:
dtrace -n 'pid$target::memcpy:entry { @ = quantize(arg2); }' -c "cc
-target x86_64-unknown-freebsd14.0
--sysroot=/usr/obj/usr/src/amd64.amd64/tmp
-B/usr/obj/usr/src/amd64.amd64/tmp/usr/bin -c -O2 -pipe
-fno-strict-aliasing  -g -nostdinc  -I. -I/usr/src/sys
-I/usr/src/sys/contrib/ck/include -I/usr/src/sys/contrib/libfdt
-D_KERNEL -DHAVE_KERNEL_OPTION_HEADERS -include opt_global.h
-fno-common    -fno-omit-frame-pointer -mno-omit-leaf-frame-pointer
-MD  -MF.depend.vfs_cache.o -MTvfs_cache.o
-fdebug-prefix-map=./machine=/usr/src/sys/amd64/include
-fdebug-prefix-map=./x86=/usr/src/sys/x86/include
-fdebug-prefix-map=./i386=/usr/src/sys/i386/include -mcmodel=kernel
-mno-red-zone -mno-mmx -mno-sse -msoft-float
-fno-asynchronous-unwind-tables -ffreestanding -fwrapv
-fstack-protector -Wall -Wnested-externs -Wstrict-prototypes
-Wmissing-prototypes -Wpointer-arith -Wcast-qual -Wundef
-Wno-pointer-sign -D__printf__=__freebsd_kprintf__
-Wmissing-include-dirs -fdiagnostics-show-option -Wno-unknown-pragmas
-Wno-error=tautological-compare -Wno-error=empty-body
-Wno-error=parentheses-equality -Wno-error=unused-function
-Wno-error=pointer-sign -Wno-error=shift-negative-value
-Wno-address-of-packed-member -Wno-error=array-parameter
-Wno-error=deprecated-non-prototype -Wno-error=strict-prototypes
-Wno-error=unused-but-set-variable -Wno-format-zero-length   -mno-aes
-mno-avx  -std=iso9899:1999 -Werror /usr/src/sys/kern/vfs_cache.c"

tl;dr if this goes in, the fate of memcpy thing will need to be
handled separtely
-- 
Mateusz Guzik <mjguzik gmail.com>