Re: CFT: snmalloc as libc malloc

From: David Chisnall <theraven_at_FreeBSD.org>
Date: Sun, 05 Mar 2023 17:27:16 UTC
Hi,

Thanks for the comments.  

The FlagLock is used in two places:

 - Early init, before anything in libc can be guaranteed to exist.
 - Some *very* slow paths, where contention is incredibly improbable.

We could replace them with _umtx_op / futex locks, but I doubt you’d see a difference.  Do you have any workloads where locks introduce contention?  The locks are all abstracted away in a single place, so it would be fairly simple to add a couple of lock functions to the PAL.  Probably a 10-line change in total, but since it hasn’t shown up in any benchmarking that we’ve done yet it hasn’t been a priority.

With respect to NUMA, if a thread is pinned to a core and the kernel is handling it local memory, then you should see good performance initially.  All allocations are satisfied from a pool that is local to the core.  If they are freed in other threads then they are batched and returned to the allocating thread.  It is possible for chunks that are originally created on one core to be recycled and used on another core.  This will happen at >page granularity and so is better handled in the kernel.

If you have a benchmark for memcpy that shows a big slowdown with the bounds checking, I would like to see it so that we can use it to optimise the implementation.  12-byte memcpys are common, but also fit well in the store queues overlapped with the bounds checks.  I’d be *very* interested in workloads where this is not the case.

I broke my shoulder a week ago and so am typing very slowly with one hand, but patches are very welcome.  There is a PR under review at the moment to add memmove, which could benefit from some more tuning before it’s merged.

David


> On 5 Mar 2023, at 16:47, Mateusz Guzik <mjguzik@gmail.com> wrote:
> 
> Apologies for significant delay, got other stuff and this fell through
> the cracks.
> 
> I'm going to start with snmalloc remarks, memcpy-related response is down below.
> 
> I find it quite weird there are absolutely no mentions of NUMA in a
> general-purpose allocator made in recent years. While nothing can be
> done for an application which does not use any affinity, what *can* be
> done is not negatively affecting programs which *do* use it. Most
> notably making sure that allocations backed by one page don't land in
> threads from distinct domains. It is unclear to me what, if anything,
> is the allocator doing on this front, but I also don't know if this
> would be a regression vs jemalloc.
> 
> One major issue I found is the use of hand-rolled spinlocks -- while
> they may be fine for specialized uses, they are a non-starter for
> general purpose. This is the age-old problem of priority problems
> where you can end up yield()ing all you want, while the lock owner
> nowhere near CPU or the scheduler does not think you should give way
> to it. The least bad thing to do would be to add umtx support to
> remedy this problem, but that's a lot of work and it is probably not
> warranted. Then the fallback is pthread locks -- while pessimal, they
> don't suffer the problem. The lock operation not being in the fast
> path is not a factor for this problem.
> 
> I tried to do a sensible benchmark of the thing, but ran into snags to
> sort out first. The idea was to 'tinderbox' it, except:
> 1. make's use of poll is a massive scalability bottleneck, addressed
> here in an experimental manner:
> https://github.com/macdice/freebsd/tree/bmake-shmem
> 2. with that out of the way there is still massive contention in the
> vm subsystem. I had a patch for most of it here:
> https://reviews.freebsd.org/D36038 . it rebases cleanly but no longer
> works, I don't know why yet
> 
> There is other issues.
> 
> Ultimately if snmalloc is to ship, it would be nice ot have it for 14
> and we are nearing the time where it will be too late, so I'll try to
> sort it all out next week.
> 
> Meanwhile there is the lock issue to sort out.
> 
> On to memcpy
> 
> On 2/9/23, David Chisnall <theraven@freebsd.org> wrote:
>> On 9 Feb 2023, at 20:53, Mateusz Guzik <mjguzik@gmail.com> wrote:
>>> 
>>> First and foremost, perhaps I should clear up that I have no opinion
>>> on snmalloc or it replacing jemalloc. I'm here only about the memcpy
>>> thing.
>>> 
>>> Inspecting assembly is what I intended to do, but got the compilation
>>> problem.
>> 
>> If you just want to look at the memcpy, you might do better using the
>> upstream (CMake) build system, which builds a shim library that you can
>> disassemble and look at.
>> 
>>> So, as someone who worked on memcpy previously, I note the variant
>>> currently implemented in libc is pessimal for sizes > 16 bytes because
>>> it does not use SIMD. I do have plans to rectify that, but ENOTIME.
>> 
>> That’s one of the benefits of the C++ implementation.  We were able to try a
>> few dozen variations in a morning.  Writing a single one of those in
>> assembly would take a similar amount of time.
>> 
>> We were inspired by the automemcpy paper, which demonstrated that you can
>> write memcpy implementations tuned to specific workloads in higher-level
>> languages and get better performance.
>> 
>>> I also note CPUs are incredibly picky when it comes to starting point
>>> of the routine. The officialy recommended alignment of 16 bytes is
>>> basically a tradeoff between total binary size and performance. Should
>>> you move the routine at different 16 bytes intervals you will easily
>>> find big variation in performance, depending on how big of an
>>> alignment you ended up with.
>> 
>> Yes, that’s what our test does.  It does a large number of different copies
>> with different sizes and alignments.
>> 
> 
> I'm not talking about placement of buffers passed to memcpy, I'm
> talking about placement of *code implementing memcpy*.
> 
>>> I'm trying to say that if you end up with different funcs depending on
>>> bounds checking and you only align them to 16 bytes, the benchmark is
>>> most likely inaccurate if only for this reason.
>> 
>> I’m not sure I follow this, sorry.
>> 
> 
> If the address of the *func itself* is divisible by 16 but not 32, it
> may end up with a different performance profile depending on uarch.
> And the code may randomly move up and down as you change the source,
> thus stabilizing it at some value (say 32) would be the best thing to
> do.
> 
>>>> The fastest on x86 is roughly:
>>>> 
>>>> - A jump table of power for small sizes that do power-of-two-sized small
>>>> copies (double-word, word, half-word, and byte) to perform the copy.
>>> 
>>> Last I checked this was not true. The last slow thing to do was to
>>> branch on few sizes and handle them with overlapping stores. Roughly
>>> what memcpy in libc is doing now.
>>> 
>>> Jump table aside, you got all sizes "spelled out", that is just
>>> avoidable impact on icache. While overlapping stores do come with some
>>> penalty, it is cheaper than the above combined.
>> 
>> They’re not all spelled out, because a lot of them are subsets of the
>> others.  The compiler does a *very* good job of optimising this.  The
>> generated assembly was a lot more complex than anything I could write by
>> hand.
>> 
> 
> I said they are spelled out in the code which ends up being generated,
> it looks like this:
> [snip]
> mov    (%rsi),%rax
> mov    %rax,(%rdi)
> pop    %rbp
> ret
> mov    (%rsi),%rax
> mov    %rax,(%rdi)
> movzbl 0x8(%rsi),%eax
> mov    %al,0x8(%rdi)
> pop    %rbp
> ret
> mov    (%rsi),%rax
> mov    %rax,(%rdi)
> movzwl 0x8(%rsi),%eax
> mov    %ax,0x8(%rdi)
> pop    %rbp
> ret
> [/snip]
> 
> That section is *massive* and it is pessimal, once more the thing to
> do is to do a few branches to land in sections handling stuff with
> overlapping stores, akin to what bionic is doing here:
> https://android.googlesource.com/platform/bionic/+/refs/heads/master/libc/arch-x86_64/string/sse2-memmove-slm.S
> 
>>> I don't have numbers nor bench code handy, but if you insist on
>>> contesting the above I'll be glad to provide them.
>>> 
>>>> - A vectorised copy for medium-sized copies using a loop of SSE copies.
>>> 
>>> Depends on what you mean by medium and which simd instructions, but
>>> yes, they should be used here.
>> 
>> This could probably do with more tuning, but all of this is configurable
>> with a couple of template parameters.  If it’s useful to have more variants
>> for different microarchitectures then it’s a trivial tweak to generate
>> them.
>> 
>>>> - rep movsb for large copies.
>>> 
>>> There are still old cpus here and there which don't support ERMS. They
>>> are very negatively impacted the above and should roll with rep stosq
>>> instead.
>> 
>> We tested on some microarchitectures without FRMS but didn’t compare with
>> rep stosq as an alternative.  The rep movsb is inline assembly
>> (https://github.com/microsoft/snmalloc/blob/4370a23f3e5f1d1d06967b1e0d6162bf1ee81196/src/snmalloc/global/memcpy.h#L351)
>> so it would be quite easy to compare.  Very happy to take patches that make
>> things faster!
>> 
> 
>> Each tuned version is a dozen lines of code (plus a lot of comments to
>> explain *why* it does things a particular way), so it’s very easy to add
>> different versions with different tuning.
>> 
>>> I see the code takes some care to align the target buffer. That's
>>> good, but not necessary on CPUs with FSRM.
>> 
>> It doesn’t hurt though, on the sizes where it’s actually beneficial to use
>> the rep sequence the cost of alignment is negligible.
>> 
>>> All that said, I have hard time believing the impact of bounds
>>> checking on short copies is around 20% or so. The code to do it looks
>>> super slow (not that I know to do better).
>> 
>> The last time I looked at the disassembly, the code for the bounds check
>> touched three cache lines and has two branches.  It fits well in a
>> superscalar pipeline so adds a handful of cycles.  This is basically in the
>> noise for large copies but is double-digit percentage overhead for small
>> ones.
>> 
> 
> I'll be blunt here: that's misleading at best and intellectually
> dishonest at worst.
> 
> By your own admission very small copies are most common and the very
> paper you refer to (automemcpy) gives a figure of 96% real-world
> copies being 128 bytes of less. These are also the copies which are
> most affected.
> 
> As I tried to explain super short copies have very little to do, and
> the impact stated above can't be an accurate estimation on top of an
> ~optimal routine.
> 
> Your benchmark code artificially pessimizes comparison to whatever is
> in libc because calls go through PLT, compared to calls to your own
> routine which don't. The difference does matter for short copies.
> 
> I ran a quick check against glibc as shipped with Ubuntu 22. Your
> routine was massively slower for sizes < 12 and then started evening
> up with glibc, after that it was going back and forth due to what's
> likely just measurement error.
> 
>> My original version extended the FreeBSD assembly memcpy with a call to a
>> function that did the bounds checks, but that had *much* worse performance.
>> We could insert assembly to do the bounds checks, but that’s hard to update
>> if we change the metadata layout in malloc.
>> 
> 
> I'm not saying the variant as shipped now is any good or should be
> used, I'm saying overlapping stores need to be used as opposed to a
> jump table. The current code should be whacked.
> 
> Even then of course it had to get slower -- you added a func call in there.
> 
>> We could modify the ifunc resolvers to use the current memcpy if you don’t
>> care about bounds checks.  I did a tiny amount of testing (mostly clang and
>> lld) in this configuration and it was slower than using the snmalloc memcpy,
>> but I didn’t do enough benchmarking to be confident of that.  Hence the
>> CFT.
>> 
> 
> Once more see previous remarks of the func currently implemented not being fast.
> 
> The thing *to do* in the area is to implement sensible memcpy, it may
> be the automemcpy paper would be good basis to do it.
> 
> -- 
> Mateusz Guzik <mjguzik gmail.com>