From nobody Sun Mar 05 17:27:16 2023 X-Original-To: freebsd-hackers@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4PV7tt6SQ6z3wjL3 for ; Sun, 5 Mar 2023 17:27:18 +0000 (UTC) (envelope-from theraven@FreeBSD.org) Received: from smtp.freebsd.org (smtp.freebsd.org [96.47.72.83]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (4096 bits) client-digest SHA256) (Client CN "smtp.freebsd.org", Issuer "R3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4PV7tt5xz9z3MJy; Sun, 5 Mar 2023 17:27:18 +0000 (UTC) (envelope-from theraven@FreeBSD.org) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim; t=1678037238; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=cK/GyFeuV4ID2PebPxUKpVhBQzWglWiusR35D02TV74=; b=MsAPhZrNLeIiS3dAL0+NQt0kUVGYWy0JE2d5fEx7qWfLcRidz8MC3BIgCjhteb2rPeOx7m oEJqH4HshGJ0tBdrdAH1e77C6I6k6sS3XaLy2RvygwqfSVzRRnVsebu93lmlE0galmEkLi At+gEHASjlMEhAxDJ6mrP2cBkczYALxYQkaPb5CBkNKuzTRXLBDaISbPkMWdXssgqcSHRe xS67sW7DvKV/0aDsLK1428ZUHr9Zq69f66p4ZpUPprD99oj0/dcEsYK8johEo0ON4UP9+Y uwB+FsB7o1VjqB/JtKfOpXcS9b/jAzFzwiGv/mTRdWnf/mXJezbaekSjS5d0jw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim; t=1678037238; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=cK/GyFeuV4ID2PebPxUKpVhBQzWglWiusR35D02TV74=; b=IucZsB2nXwS97+pfsqlQ42vsRzYCmfwDztJHP+JulvHtI5b++yhAJrUrT3jxeIqCuVYEgz qg3KIX0lL25B2Bih7FOiNqjNDNqKmGL4AMZ36lEotkLQ5eouLnoiIs5uE7r0prgR/JCQe8 Wc47SVFlVYGj11mXR+iHtXFu2RSZiKuDUcTiFrfSs/keS3gAIT6saYzX7HYpINGCTh1d6B cYvgGyzlFhK7SaOWkQ/UFxuoNwcJndq4cVmk0kAUkVPa+jYjzK5rwLMrEz2l1LKbdXv7P7 Y5/YTvL/jAP8jnc9Kg1JnVDu+nirQNGy/c6VGo3ZJE+lstjDCykGwRRhE5/JPQ== ARC-Authentication-Results: i=1; mx1.freebsd.org; none ARC-Seal: i=1; s=dkim; d=freebsd.org; t=1678037238; a=rsa-sha256; cv=none; b=kt/PLdrvtC6QJPh5gAN0RrSm7z7QkkdylilDOhWYxelJLnkOwBrB90QTyh5yaLr8MUXtPB W/tuw+dbVaHANCobGlAlnfS1/35J4u4PnxRASmC+CyWQ9MCIO8k46XjrnxFjuA/mfwfF/C KMKQp3nvoJEAwo3gPLje8kCblWjbJFSnVZ6ClRZlvL+mN9Dv4zJXtbVdfkTv5USO8NzDiD tA3EpL2GR//8fRlu+y8SKEiN9rGqbe5mCKdM83PnFjcvVn/RZ4v+6gaFnEtlgB5iVqMCZy 1ipcX/kudPm/rMUsY5MVgz4MLzg3vw+E7xTnQn6eJabxAgMQaGWB0FAzlVwywg== Received: from smtp.theravensnest.org (smtp.theravensnest.org [45.77.103.195]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) (Authenticated sender: theraven) by smtp.freebsd.org (Postfix) with ESMTPSA id 4PV7tt4pfvzhYJ; Sun, 5 Mar 2023 17:27:18 +0000 (UTC) (envelope-from theraven@FreeBSD.org) Received: from smtpclient.apple (host86-166-82-133.range86-166.btcentralplus.com [86.166.82.133]) by smtp.theravensnest.org (Postfix) with ESMTPSA id 8D3AE108DC; Sun, 5 Mar 2023 17:27:17 +0000 (GMT) Content-Type: text/plain; charset=utf-8 List-Id: Technical discussions relating to FreeBSD List-Archive: https://lists.freebsd.org/archives/freebsd-hackers List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-hackers@freebsd.org Mime-Version: 1.0 (Mac OS X Mail 14.0 \(3654.120.0.1.13\)) Subject: Re: CFT: snmalloc as libc malloc From: David Chisnall In-Reply-To: Date: Sun, 5 Mar 2023 17:27:16 +0000 Cc: freebsd-hackers Content-Transfer-Encoding: quoted-printable Message-Id: References: <2f3dcda0-5135-290a-2dff-683b2e9fe271@FreeBSD.org> <2836EF0C-8B4B-441C-927B-3796457A3865@FreeBSD.org> To: Mateusz Guzik X-Mailer: Apple Mail (2.3654.120.0.1.13) X-ThisMailContainsUnwantedMimeParts: N Hi, Thanks for the comments. =20 The FlagLock is used in two places: - Early init, before anything in libc can be guaranteed to exist. - Some *very* slow paths, where contention is incredibly improbable. We could replace them with _umtx_op / futex locks, but I doubt you=E2=80=99= d see a difference. Do you have any workloads where locks introduce = contention? The locks are all abstracted away in a single place, so it = would be fairly simple to add a couple of lock functions to the PAL. = Probably a 10-line change in total, but since it hasn=E2=80=99t shown up = in any benchmarking that we=E2=80=99ve done yet it hasn=E2=80=99t been a = priority. With respect to NUMA, if a thread is pinned to a core and the kernel is = handling it local memory, then you should see good performance = initially. All allocations are satisfied from a pool that is local to = the core. If they are freed in other threads then they are batched and = returned to the allocating thread. It is possible for chunks that are = originally created on one core to be recycled and used on another core. = This will happen at >page granularity and so is better handled in the = kernel. If you have a benchmark for memcpy that shows a big slowdown with the = bounds checking, I would like to see it so that we can use it to = optimise the implementation. 12-byte memcpys are common, but also fit = well in the store queues overlapped with the bounds checks. I=E2=80=99d = be *very* interested in workloads where this is not the case. I broke my shoulder a week ago and so am typing very slowly with one = hand, but patches are very welcome. There is a PR under review at the = moment to add memmove, which could benefit from some more tuning before = it=E2=80=99s merged. David > On 5 Mar 2023, at 16:47, Mateusz Guzik wrote: >=20 > Apologies for significant delay, got other stuff and this fell through > the cracks. >=20 > I'm going to start with snmalloc remarks, memcpy-related response is = down below. >=20 > I find it quite weird there are absolutely no mentions of NUMA in a > general-purpose allocator made in recent years. While nothing can be > done for an application which does not use any affinity, what *can* be > done is not negatively affecting programs which *do* use it. Most > notably making sure that allocations backed by one page don't land in > threads from distinct domains. It is unclear to me what, if anything, > is the allocator doing on this front, but I also don't know if this > would be a regression vs jemalloc. >=20 > One major issue I found is the use of hand-rolled spinlocks -- while > they may be fine for specialized uses, they are a non-starter for > general purpose. This is the age-old problem of priority problems > where you can end up yield()ing all you want, while the lock owner > nowhere near CPU or the scheduler does not think you should give way > to it. The least bad thing to do would be to add umtx support to > remedy this problem, but that's a lot of work and it is probably not > warranted. Then the fallback is pthread locks -- while pessimal, they > don't suffer the problem. The lock operation not being in the fast > path is not a factor for this problem. >=20 > I tried to do a sensible benchmark of the thing, but ran into snags to > sort out first. The idea was to 'tinderbox' it, except: > 1. make's use of poll is a massive scalability bottleneck, addressed > here in an experimental manner: > https://github.com/macdice/freebsd/tree/bmake-shmem > 2. with that out of the way there is still massive contention in the > vm subsystem. I had a patch for most of it here: > https://reviews.freebsd.org/D36038 . it rebases cleanly but no longer > works, I don't know why yet >=20 > There is other issues. >=20 > Ultimately if snmalloc is to ship, it would be nice ot have it for 14 > and we are nearing the time where it will be too late, so I'll try to > sort it all out next week. >=20 > Meanwhile there is the lock issue to sort out. >=20 > On to memcpy >=20 > On 2/9/23, David Chisnall wrote: >> On 9 Feb 2023, at 20:53, Mateusz Guzik wrote: >>>=20 >>> First and foremost, perhaps I should clear up that I have no opinion >>> on snmalloc or it replacing jemalloc. I'm here only about the memcpy >>> thing. >>>=20 >>> Inspecting assembly is what I intended to do, but got the = compilation >>> problem. >>=20 >> If you just want to look at the memcpy, you might do better using the >> upstream (CMake) build system, which builds a shim library that you = can >> disassemble and look at. >>=20 >>> So, as someone who worked on memcpy previously, I note the variant >>> currently implemented in libc is pessimal for sizes > 16 bytes = because >>> it does not use SIMD. I do have plans to rectify that, but ENOTIME. >>=20 >> That=E2=80=99s one of the benefits of the C++ implementation. We = were able to try a >> few dozen variations in a morning. Writing a single one of those in >> assembly would take a similar amount of time. >>=20 >> We were inspired by the automemcpy paper, which demonstrated that you = can >> write memcpy implementations tuned to specific workloads in = higher-level >> languages and get better performance. >>=20 >>> I also note CPUs are incredibly picky when it comes to starting = point >>> of the routine. The officialy recommended alignment of 16 bytes is >>> basically a tradeoff between total binary size and performance. = Should >>> you move the routine at different 16 bytes intervals you will easily >>> find big variation in performance, depending on how big of an >>> alignment you ended up with. >>=20 >> Yes, that=E2=80=99s what our test does. It does a large number of = different copies >> with different sizes and alignments. >>=20 >=20 > I'm not talking about placement of buffers passed to memcpy, I'm > talking about placement of *code implementing memcpy*. >=20 >>> I'm trying to say that if you end up with different funcs depending = on >>> bounds checking and you only align them to 16 bytes, the benchmark = is >>> most likely inaccurate if only for this reason. >>=20 >> I=E2=80=99m not sure I follow this, sorry. >>=20 >=20 > If the address of the *func itself* is divisible by 16 but not 32, it > may end up with a different performance profile depending on uarch. > And the code may randomly move up and down as you change the source, > thus stabilizing it at some value (say 32) would be the best thing to > do. >=20 >>>> The fastest on x86 is roughly: >>>>=20 >>>> - A jump table of power for small sizes that do power-of-two-sized = small >>>> copies (double-word, word, half-word, and byte) to perform the = copy. >>>=20 >>> Last I checked this was not true. The last slow thing to do was to >>> branch on few sizes and handle them with overlapping stores. Roughly >>> what memcpy in libc is doing now. >>>=20 >>> Jump table aside, you got all sizes "spelled out", that is just >>> avoidable impact on icache. While overlapping stores do come with = some >>> penalty, it is cheaper than the above combined. >>=20 >> They=E2=80=99re not all spelled out, because a lot of them are = subsets of the >> others. The compiler does a *very* good job of optimising this. The >> generated assembly was a lot more complex than anything I could write = by >> hand. >>=20 >=20 > I said they are spelled out in the code which ends up being generated, > it looks like this: > [snip] > mov (%rsi),%rax > mov %rax,(%rdi) > pop %rbp > ret > mov (%rsi),%rax > mov %rax,(%rdi) > movzbl 0x8(%rsi),%eax > mov %al,0x8(%rdi) > pop %rbp > ret > mov (%rsi),%rax > mov %rax,(%rdi) > movzwl 0x8(%rsi),%eax > mov %ax,0x8(%rdi) > pop %rbp > ret > [/snip] >=20 > That section is *massive* and it is pessimal, once more the thing to > do is to do a few branches to land in sections handling stuff with > overlapping stores, akin to what bionic is doing here: > = https://android.googlesource.com/platform/bionic/+/refs/heads/master/libc/= arch-x86_64/string/sse2-memmove-slm.S >=20 >>> I don't have numbers nor bench code handy, but if you insist on >>> contesting the above I'll be glad to provide them. >>>=20 >>>> - A vectorised copy for medium-sized copies using a loop of SSE = copies. >>>=20 >>> Depends on what you mean by medium and which simd instructions, but >>> yes, they should be used here. >>=20 >> This could probably do with more tuning, but all of this is = configurable >> with a couple of template parameters. If it=E2=80=99s useful to have = more variants >> for different microarchitectures then it=E2=80=99s a trivial tweak to = generate >> them. >>=20 >>>> - rep movsb for large copies. >>>=20 >>> There are still old cpus here and there which don't support ERMS. = They >>> are very negatively impacted the above and should roll with rep = stosq >>> instead. >>=20 >> We tested on some microarchitectures without FRMS but didn=E2=80=99t = compare with >> rep stosq as an alternative. The rep movsb is inline assembly >> = (https://github.com/microsoft/snmalloc/blob/4370a23f3e5f1d1d06967b1e0d6162= bf1ee81196/src/snmalloc/global/memcpy.h#L351) >> so it would be quite easy to compare. Very happy to take patches = that make >> things faster! >>=20 >=20 >> Each tuned version is a dozen lines of code (plus a lot of comments = to >> explain *why* it does things a particular way), so it=E2=80=99s very = easy to add >> different versions with different tuning. >>=20 >>> I see the code takes some care to align the target buffer. That's >>> good, but not necessary on CPUs with FSRM. >>=20 >> It doesn=E2=80=99t hurt though, on the sizes where it=E2=80=99s = actually beneficial to use >> the rep sequence the cost of alignment is negligible. >>=20 >>> All that said, I have hard time believing the impact of bounds >>> checking on short copies is around 20% or so. The code to do it = looks >>> super slow (not that I know to do better). >>=20 >> The last time I looked at the disassembly, the code for the bounds = check >> touched three cache lines and has two branches. It fits well in a >> superscalar pipeline so adds a handful of cycles. This is basically = in the >> noise for large copies but is double-digit percentage overhead for = small >> ones. >>=20 >=20 > I'll be blunt here: that's misleading at best and intellectually > dishonest at worst. >=20 > By your own admission very small copies are most common and the very > paper you refer to (automemcpy) gives a figure of 96% real-world > copies being 128 bytes of less. These are also the copies which are > most affected. >=20 > As I tried to explain super short copies have very little to do, and > the impact stated above can't be an accurate estimation on top of an > ~optimal routine. >=20 > Your benchmark code artificially pessimizes comparison to whatever is > in libc because calls go through PLT, compared to calls to your own > routine which don't. The difference does matter for short copies. >=20 > I ran a quick check against glibc as shipped with Ubuntu 22. Your > routine was massively slower for sizes < 12 and then started evening > up with glibc, after that it was going back and forth due to what's > likely just measurement error. >=20 >> My original version extended the FreeBSD assembly memcpy with a call = to a >> function that did the bounds checks, but that had *much* worse = performance. >> We could insert assembly to do the bounds checks, but that=E2=80=99s = hard to update >> if we change the metadata layout in malloc. >>=20 >=20 > I'm not saying the variant as shipped now is any good or should be > used, I'm saying overlapping stores need to be used as opposed to a > jump table. The current code should be whacked. >=20 > Even then of course it had to get slower -- you added a func call in = there. >=20 >> We could modify the ifunc resolvers to use the current memcpy if you = don=E2=80=99t >> care about bounds checks. I did a tiny amount of testing (mostly = clang and >> lld) in this configuration and it was slower than using the snmalloc = memcpy, >> but I didn=E2=80=99t do enough benchmarking to be confident of that. = Hence the >> CFT. >>=20 >=20 > Once more see previous remarks of the func currently implemented not = being fast. >=20 > The thing *to do* in the area is to implement sensible memcpy, it may > be the automemcpy paper would be good basis to do it. >=20 > --=20 > Mateusz Guzik