Re: CFT: snmalloc as libc malloc

Reply: David Chisnall : "Re: CFT: snmalloc as libc malloc"
In reply to: David Chisnall : "Re: CFT: snmalloc as libc malloc"
Go to: [ bottom of page ] [ top of archives ] [ this month ]
From: Mateusz Guzik <mjguzik_at_gmail.com>
Date: Sun, 05 Mar 2023 16:47:58 UTC
Apologies for significant delay, got other stuff and this fell through
the cracks.

I'm going to start with snmalloc remarks, memcpy-related response is down below.

I find it quite weird there are absolutely no mentions of NUMA in a
general-purpose allocator made in recent years. While nothing can be
done for an application which does not use any affinity, what *can* be
done is not negatively affecting programs which *do* use it. Most
notably making sure that allocations backed by one page don't land in
threads from distinct domains. It is unclear to me what, if anything,
is the allocator doing on this front, but I also don't know if this
would be a regression vs jemalloc.

One major issue I found is the use of hand-rolled spinlocks -- while
they may be fine for specialized uses, they are a non-starter for
general purpose. This is the age-old problem of priority problems
where you can end up yield()ing all you want, while the lock owner
nowhere near CPU or the scheduler does not think you should give way
to it. The least bad thing to do would be to add umtx support to
remedy this problem, but that's a lot of work and it is probably not
warranted. Then the fallback is pthread locks -- while pessimal, they
don't suffer the problem. The lock operation not being in the fast
path is not a factor for this problem.

I tried to do a sensible benchmark of the thing, but ran into snags to
sort out first. The idea was to 'tinderbox' it, except:
1. make's use of poll is a massive scalability bottleneck, addressed
here in an experimental manner:
https://github.com/macdice/freebsd/tree/bmake-shmem
2. with that out of the way there is still massive contention in the
vm subsystem. I had a patch for most of it here:
https://reviews.freebsd.org/D36038 . it rebases cleanly but no longer
works, I don't know why yet

There is other issues.

Ultimately if snmalloc is to ship, it would be nice ot have it for 14
and we are nearing the time where it will be too late, so I'll try to
sort it all out next week.

Meanwhile there is the lock issue to sort out.

On to memcpy

On 2/9/23, David Chisnall <theraven@freebsd.org> wrote:
> On 9 Feb 2023, at 20:53, Mateusz Guzik <mjguzik@gmail.com> wrote:
>>
>> First and foremost, perhaps I should clear up that I have no opinion
>> on snmalloc or it replacing jemalloc. I'm here only about the memcpy
>> thing.
>>
>> Inspecting assembly is what I intended to do, but got the compilation
>> problem.
>
> If you just want to look at the memcpy, you might do better using the
> upstream (CMake) build system, which builds a shim library that you can
> disassemble and look at.
>
>> So, as someone who worked on memcpy previously, I note the variant
>> currently implemented in libc is pessimal for sizes > 16 bytes because
>> it does not use SIMD. I do have plans to rectify that, but ENOTIME.
>
> That’s one of the benefits of the C++ implementation.  We were able to try a
> few dozen variations in a morning.  Writing a single one of those in
> assembly would take a similar amount of time.
>
> We were inspired by the automemcpy paper, which demonstrated that you can
> write memcpy implementations tuned to specific workloads in higher-level
> languages and get better performance.
>
>> I also note CPUs are incredibly picky when it comes to starting point
>> of the routine. The officialy recommended alignment of 16 bytes is
>> basically a tradeoff between total binary size and performance. Should
>> you move the routine at different 16 bytes intervals you will easily
>> find big variation in performance, depending on how big of an
>> alignment you ended up with.
>
> Yes, that’s what our test does.  It does a large number of different copies
> with different sizes and alignments.
>

I'm not talking about placement of buffers passed to memcpy, I'm
talking about placement of *code implementing memcpy*.

>> I'm trying to say that if you end up with different funcs depending on
>> bounds checking and you only align them to 16 bytes, the benchmark is
>> most likely inaccurate if only for this reason.
>
> I’m not sure I follow this, sorry.
>

If the address of the *func itself* is divisible by 16 but not 32, it
may end up with a different performance profile depending on uarch.
And the code may randomly move up and down as you change the source,
thus stabilizing it at some value (say 32) would be the best thing to
do.

>>> The fastest on x86 is roughly:
>>>
>>> - A jump table of power for small sizes that do power-of-two-sized small
>>> copies (double-word, word, half-word, and byte) to perform the copy.
>>
>> Last I checked this was not true. The last slow thing to do was to
>> branch on few sizes and handle them with overlapping stores. Roughly
>> what memcpy in libc is doing now.
>>
>> Jump table aside, you got all sizes "spelled out", that is just
>> avoidable impact on icache. While overlapping stores do come with some
>> penalty, it is cheaper than the above combined.
>
> They’re not all spelled out, because a lot of them are subsets of the
> others.  The compiler does a *very* good job of optimising this.  The
> generated assembly was a lot more complex than anything I could write by
> hand.
>

I said they are spelled out in the code which ends up being generated,
it looks like this:
[snip]
mov    (%rsi),%rax
mov    %rax,(%rdi)
pop    %rbp
ret
mov    (%rsi),%rax
mov    %rax,(%rdi)
movzbl 0x8(%rsi),%eax
mov    %al,0x8(%rdi)
pop    %rbp
ret
mov    (%rsi),%rax
mov    %rax,(%rdi)
movzwl 0x8(%rsi),%eax
mov    %ax,0x8(%rdi)
pop    %rbp
ret
[/snip]

That section is *massive* and it is pessimal, once more the thing to
do is to do a few branches to land in sections handling stuff with
overlapping stores, akin to what bionic is doing here:
https://android.googlesource.com/platform/bionic/+/refs/heads/master/libc/arch-x86_64/string/sse2-memmove-slm.S

>> I don't have numbers nor bench code handy, but if you insist on
>> contesting the above I'll be glad to provide them.
>>
>>> - A vectorised copy for medium-sized copies using a loop of SSE copies.
>>
>> Depends on what you mean by medium and which simd instructions, but
>> yes, they should be used here.
>
> This could probably do with more tuning, but all of this is configurable
> with a couple of template parameters.  If it’s useful to have more variants
> for different microarchitectures then it’s a trivial tweak to generate
> them.
>
>>> - rep movsb for large copies.
>>
>> There are still old cpus here and there which don't support ERMS. They
>> are very negatively impacted the above and should roll with rep stosq
>> instead.
>
> We tested on some microarchitectures without FRMS but didn’t compare with
> rep stosq as an alternative.  The rep movsb is inline assembly
> (https://github.com/microsoft/snmalloc/blob/4370a23f3e5f1d1d06967b1e0d6162bf1ee81196/src/snmalloc/global/memcpy.h#L351)
> so it would be quite easy to compare.  Very happy to take patches that make
> things faster!
>

> Each tuned version is a dozen lines of code (plus a lot of comments to
> explain *why* it does things a particular way), so it’s very easy to add
> different versions with different tuning.
>
>> I see the code takes some care to align the target buffer. That's
>> good, but not necessary on CPUs with FSRM.
>
> It doesn’t hurt though, on the sizes where it’s actually beneficial to use
> the rep sequence the cost of alignment is negligible.
>
>> All that said, I have hard time believing the impact of bounds
>> checking on short copies is around 20% or so. The code to do it looks
>> super slow (not that I know to do better).
>
> The last time I looked at the disassembly, the code for the bounds check
> touched three cache lines and has two branches.  It fits well in a
> superscalar pipeline so adds a handful of cycles.  This is basically in the
> noise for large copies but is double-digit percentage overhead for small
> ones.
>

I'll be blunt here: that's misleading at best and intellectually
dishonest at worst.

By your own admission very small copies are most common and the very
paper you refer to (automemcpy) gives a figure of 96% real-world
copies being 128 bytes of less. These are also the copies which are
most affected.

As I tried to explain super short copies have very little to do, and
the impact stated above can't be an accurate estimation on top of an
~optimal routine.

Your benchmark code artificially pessimizes comparison to whatever is
in libc because calls go through PLT, compared to calls to your own
routine which don't. The difference does matter for short copies.

I ran a quick check against glibc as shipped with Ubuntu 22. Your
routine was massively slower for sizes < 12 and then started evening
up with glibc, after that it was going back and forth due to what's
likely just measurement error.

> My original version extended the FreeBSD assembly memcpy with a call to a
> function that did the bounds checks, but that had *much* worse performance.
> We could insert assembly to do the bounds checks, but that’s hard to update
> if we change the metadata layout in malloc.
>

I'm not saying the variant as shipped now is any good or should be
used, I'm saying overlapping stores need to be used as opposed to a
jump table. The current code should be whacked.

Even then of course it had to get slower -- you added a func call in there.

> We could modify the ifunc resolvers to use the current memcpy if you don’t
> care about bounds checks.  I did a tiny amount of testing (mostly clang and
> lld) in this configuration and it was slower than using the snmalloc memcpy,
> but I didn’t do enough benchmarking to be confident of that.  Hence the
> CFT.
>

Once more see previous remarks of the func currently implemented not being fast.

The thing *to do* in the area is to implement sensible memcpy, it may
be the automemcpy paper would be good basis to do it.

-- 
Mateusz Guzik <mjguzik gmail.com>