From nobody Thu Feb 09 21:34:38 2023
X-Original-To: freebsd-hackers@mlmmj.nyi.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
	by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4PCVWQ1TpWz3nmGB
	for <freebsd-hackers@mlmmj.nyi.freebsd.org>; Thu,  9 Feb 2023 21:34:42 +0000 (UTC)
	(envelope-from mjguzik@gmail.com)
Received: from mail-oi1-x22f.google.com (mail-oi1-x22f.google.com [IPv6:2607:f8b0:4864:20::22f])
	(using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256
	 client-signature RSA-PSS (2048 bits) client-digest SHA256)
	(Client CN "smtp.gmail.com", Issuer "GTS CA 1D4" (verified OK))
	by mx1.freebsd.org (Postfix) with ESMTPS id 4PCVWP1kZwz4DRf;
	Thu,  9 Feb 2023 21:34:41 +0000 (UTC)
	(envelope-from mjguzik@gmail.com)
Authentication-Results: mx1.freebsd.org;
	dkim=pass header.d=gmail.com header.s=20210112 header.b="edFBGQZ/";
	spf=pass (mx1.freebsd.org: domain of mjguzik@gmail.com designates 2607:f8b0:4864:20::22f as permitted sender) smtp.mailfrom=mjguzik@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
Received: by mail-oi1-x22f.google.com with SMTP id bx13so2817359oib.13;
        Thu, 09 Feb 2023 13:34:41 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20210112;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :references:in-reply-to:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=CSekNfIR/yaonlKxBB9st1/jgtV/qVcymxkTpP8fPNI=;
        b=edFBGQZ/0Dv1CoZlhs6K7fnfsDnXGrYleRfF+ANmbTVG0e9kpN9UqcKn2LftIy6uql
         /wOuwmCGTz0SPpWtb5lWou1WoUq/OeemnMDVB0wdPy8bTp2uEq571lIQqiAMm6XavuYa
         4xM/r3mIyLkZb388VWI0piWZ5YOks+KGPfV1/p8GGXUxbiKy2hmlG8en5NXBkY42vQep
         23m6cih5cBPi18y2A8NOQdTAoPkOX3/TirW/cT/Ltll9Rmn1/Mec3UfWfL7c6iBKi5yD
         FL+RNixFdztrcVAXBKLv4jiaWFhHk3w3w0VrSLb+FnShdGewpunjUTnYTFiHGPdtR2b5
         lqFQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :references:in-reply-to:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=CSekNfIR/yaonlKxBB9st1/jgtV/qVcymxkTpP8fPNI=;
        b=N3c1+nsQPcTewMW7Lstik8yGRT4oS0nYbgf5T2Nb0kqTg+y4DaKgvUb6pqVA7iZWG3
         ueqToPSk62nZh6v7KiYAnS1ZnM6xS16C6PsO2iwLFrVtWxWI+8zYldvEnfPM8jfOSAZ+
         YESnuVd4oy+Bd3ntwe1c1I6U2jhxRxcJR9xSgbTqyYTleTNqf06y9y+w8OxzXzFk1A4W
         mE5AMQHBlmEW2xJcOHXsOygQBFCU5SO9SesxfZtRSjapWtV1DTwVjdZbo/hj6xp3l4vb
         trjFz1Xla9U0Eo0gO/X1YH27EbEa783Wey/YRM62oJdsVrNSU47HIKx1KTrb8LqcN9gT
         fBsg==
X-Gm-Message-State: AO0yUKWZRS0/pT9ICXbVU0Prw3o+O7aoR2ZpocfHOlM1mBib3ZgI7+dR
	sx0klGm1GmhnAcXbJSzD+n1boyE9qjUHN8+KtXbvs1M8
X-Google-Smtp-Source: AK7set8YI8dIgu1ICNdroyMgjoiMCqwUDzYMzbR1Jpk5OKqc6fJF9ZlYqvGnG73H84TCMppHIl64cEq3N8KoE9B9ql0=
X-Received: by 2002:aca:5905:0:b0:37a:ca27:ae3d with SMTP id
 n5-20020aca5905000000b0037aca27ae3dmr650254oib.159.1675978479457; Thu, 09 Feb
 2023 13:34:39 -0800 (PST)
List-Id: Technical discussions relating to FreeBSD <freebsd-hackers.freebsd.org>
List-Archive: https://lists.freebsd.org/archives/freebsd-hackers
List-Help: <mailto:freebsd-hackers+help@freebsd.org>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Subscribe: <mailto:freebsd-hackers+subscribe@freebsd.org>
List-Unsubscribe: <mailto:freebsd-hackers+unsubscribe@freebsd.org>
Sender: owner-freebsd-hackers@freebsd.org
MIME-Version: 1.0
Received: by 2002:ac9:6c92:0:b0:4b3:d953:974c with HTTP; Thu, 9 Feb 2023
 13:34:38 -0800 (PST)
In-Reply-To: <CAGudoHFeFm1K+JSBXaxt2SNTv-tGFgT0onyErOCmBdVjmHaxUg@mail.gmail.com>
References: <2f3dcda0-5135-290a-2dff-683b2e9fe271@FreeBSD.org>
 <CAGudoHFYMLk6EDrSxLiWFNBoYyTKXfHLAUhZC+RF4eUE-rip8Q@mail.gmail.com>
 <E140A3A2-5C4A-4458-B365-AD693AB853E8@FreeBSD.org> <CAGudoHFeFm1K+JSBXaxt2SNTv-tGFgT0onyErOCmBdVjmHaxUg@mail.gmail.com>
From: Mateusz Guzik <mjguzik@gmail.com>
Date: Thu, 9 Feb 2023 22:34:38 +0100
Message-ID: <CAGudoHEvtNFs0+Voog4YcKoTQ026fj0MJdh2i5ZguzvGq1nWcQ@mail.gmail.com>
Subject: Re: CFT: snmalloc as libc malloc
To: David Chisnall <theraven@freebsd.org>
Cc: freebsd-hackers <freebsd-hackers@freebsd.org>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Spamd-Result: default: False [-2.90 / 15.00];
	NEURAL_HAM_LONG(-1.00)[-1.000];
	NEURAL_HAM_MEDIUM(-0.96)[-0.957];
	DMARC_POLICY_ALLOW(-0.50)[gmail.com,none];
	R_SPF_ALLOW(-0.20)[+ip6:2607:f8b0:4000::/36:c];
	R_DKIM_ALLOW(-0.20)[gmail.com:s=20210112];
	MIME_GOOD(-0.10)[text/plain];
	NEURAL_SPAM_SHORT(0.06)[0.058];
	ARC_NA(0.00)[];
	RCVD_IN_DNSWL_NONE(0.00)[2607:f8b0:4864:20::22f:from];
	MLMMJ_DEST(0.00)[freebsd-hackers@freebsd.org];
	FROM_EQ_ENVFROM(0.00)[];
	MIME_TRACE(0.00)[0:+];
	ASN(0.00)[asn:15169, ipnet:2607:f8b0::/32, country:US];
	FREEMAIL_ENVFROM(0.00)[gmail.com];
	RCPT_COUNT_TWO(0.00)[2];
	MID_RHS_MATCH_FROMTLD(0.00)[];
	TO_DN_ALL(0.00)[];
	FREEMAIL_FROM(0.00)[gmail.com];
	FROM_HAS_DN(0.00)[];
	DKIM_TRACE(0.00)[gmail.com:+];
	TO_MATCH_ENVRCPT_ALL(0.00)[];
	RCVD_TLS_LAST(0.00)[];
	RCVD_COUNT_THREE(0.00)[3];
	DWL_DNSWL_NONE(0.00)[gmail.com:dkim]
X-Rspamd-Queue-Id: 4PCVWP1kZwz4DRf
X-Spamd-Bar: --
X-ThisMailContainsUnwantedMimeParts: N

The memcpy debacle aside, I can confirm that single-threaded the new
malloc does appear faster in naive tests from will-it-scale:

$ cpuset -l 10,80,82 -- ./malloc2_threads -n -t 2
testcase:malloc/free of 1kB

before:
min:97812514 max:97849385 total:195661899
min:97819901 max:97857131 total:195677032
min:97789741 max:97833562 total:195623303

after:
min:115613762 max:124855002 total:240468764
min:115636562 max:124807148 total:240443710
min:115778776 max:124784220 total:240562996

that said, if anyone is to performed a serious test, the stock memcpy
needs to be used to rule it out as a factor. The one shipped with
snmalloc will happen to be faster for certain sizes and that may skew
whatever evaluation -- that speed increase (and in fact higher) is
achievable without snmalloc.

On 2/9/23, Mateusz Guzik <mjguzik@gmail.com> wrote:
> On 2/9/23, David Chisnall <theraven@freebsd.org> wrote:
>> On 9 Feb 2023, at 19:15, Mateusz Guzik <mjguzik@gmail.com> wrote:
>>>
>>> it fails to build for me:
>>>
>>> /usr/src/lib/libc/stdlib/snmalloc/malloc.cc:35:10: fatal error:
>>> 'override/jemalloc_compat.cc' file not found
>>> #include "override/jemalloc_compat.cc"
>>>         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>> 1 error generated.
>>> --- malloc.o ---
>>> *** [malloc.o] Error code 1
>>>
>>> make[4]: stopped in /usr/src/lib/libc
>>> /usr/src/lib/libc/stdlib/snmalloc/memcpy.cc:25:10: fatal error:
>>> 'global/memcpy.h' file not found
>>> #include <global/memcpy.h>
>>>         ^~~~~~~~~~~~~~~~~
>>> 1 error generated.
>>> --- memcpy.o ---
>>> *** [memcpy.o] Error code 1
>>
>> This looks as if you haven=E2=80=99t got the submodule?  Is there anythi=
ng in
>> contrib/snmalloc?
>>
>
> indeed, a pilot error
>
>>> anyway, I wanted to say I find the memcpy thing incredibly suspicious.
>>> I found one article in
>>> https://github.com/microsoft/snmalloc/blob/main/docs/security/GuardedMe=
mcpy.md
>>> which benches it and that made it even more suspicious. How did the
>>> benched memcpy look like inside?
>>
>> Perhaps you could share what you are suspicious about?  I don=E2=80=99t =
really
>> know
>> how to respond to something so vague.  The document you linked to has th=
e
>> benchmark that we used (though the graphs in it appear to be based on an
>> older version of the memcpy).  The PR that added PowerPC tuning has some
>> additional graphs of measurements.
>>
>> If you compile the memcpy file, you can see the assembly.  The C++
>> provides
>> a set of building blocks for producing efficient memcpy implementations.
>
> First and foremost, perhaps I should clear up that I have no opinion
> on snmalloc or it replacing jemalloc. I'm here only about the memcpy
> thing.
>
> Inspecting assembly is what I intended to do, but got the compilation
> problem.
>
> So, as someone who worked on memcpy previously, I note the variant
> currently implemented in libc is pessimal for sizes > 16 bytes because
> it does not use SIMD. I do have plans to rectify that, but ENOTIME.
>
> I also note CPUs are incredibly picky when it comes to starting point
> of the routine. The officialy recommended alignment of 16 bytes is
> basically a tradeoff between total binary size and performance. Should
> you move the routine at different 16 bytes intervals you will easily
> find big variation in performance, depending on how big of an
> alignment you ended up with.
>
> I tried to compile the benchmark but got bested by c++. I don't know
> the lang and I don't want to fight it.
>
> $ pwd
> /usr/src/contrib/snmalloc/src
> $ c++ -I. test/perf/memcpy/memcpy.cc
> [snip]
> ./snmalloc/global/../backend/../backend_helpers/../mem/../ds_core/bits.h:=
57:26:
> error: no template named 'is_integral_v' in namespace 'std'; did you
> mean 'is_integral'?
>       static_assert(std::is_integral_v<T>, "Type must be integral");
>                     ~~~~~^~~~~~~~~~~~~
>                          is_integral
>
> and tons of other errors. I did buildworld + installworld.
>
> I'm trying to say that if you end up with different funcs depending on
> bounds checking and you only align them to 16 bytes, the benchmark is
> most likely inaccurate if only for this reason.
>
>> The fastest on x86 is roughly:
>>
>>  - A jump table of power for small sizes that do power-of-two-sized smal=
l
>> copies (double-word, word, half-word, and byte) to perform the copy.
>
> Last I checked this was not true. The last slow thing to do was to
> branch on few sizes and handle them with overlapping stores. Roughly
> what memcpy in libc is doing now.
>
> Jump table aside, you got all sizes "spelled out", that is just
> avoidable impact on icache. While overlapping stores do come with some
> penalty, it is cheaper than the above combined.
>
> I don't have numbers nor bench code handy, but if you insist on
> contesting the above I'll be glad to provide them.
>
>>  - A vectorised copy for medium-sized copies using a loop of SSE copies.
>
> Depends on what you mean by medium and which simd instructions, but
> yes, they should be used here.
>
>>  - rep movsb for large copies.
>
> There are still old cpus here and there which don't support ERMS. They
> are very negatively impacted the above and should roll with rep stosq
> instead.
>
> I see the code takes some care to align the target buffer. That's
> good, but not necessary on CPUs with FSRM.
>
> All that said, I have hard time believing the impact of bounds
> checking on short copies is around 20% or so. The code to do it looks
> super slow (not that I know to do better).
>
> People like to claim short sizes are inlined by the compiler, but
> that's only true if the size is known at compilation time, which it
> often is not. Then they claim they are rare.
>
> To illustrate why that's bogus, here is clang 15 compiling vfs_cache.c:
>            value  ------------- Distribution ------------- count
>               -1 |                                         0
>                0 |@                                        19758
>                1 |@@@@@@@@                                 218420
>                2 |@@                                       67670
>                4 |@@@@                                     103914
>                8 |@@@@@@@@@@@                              301157
>               16 |@@@@@@@@@@                               293812
>               32 |@@                                       57954
>               64 |@                                        23818
>              128 |                                         13344
>              256 |@                                        18507
>              512 |                                         6342
>             1024 |                                         1710
>             2048 |                                         627
>             4096 |                                         398
>             8192 |                                         34
>            16384 |                                         10
>            32768 |                                         6
>            65536 |                                         7
>           131072 |                                         4
>           262144 |                                         1
>           524288 |                                         1
>          1048576 |                                         0
>
> obtained with this bad boy:
> dtrace -n 'pid$target::memcpy:entry { @ =3D quantize(arg2); }' -c "cc
> -target x86_64-unknown-freebsd14.0
> --sysroot=3D/usr/obj/usr/src/amd64.amd64/tmp
> -B/usr/obj/usr/src/amd64.amd64/tmp/usr/bin -c -O2 -pipe
> -fno-strict-aliasing  -g -nostdinc  -I. -I/usr/src/sys
> -I/usr/src/sys/contrib/ck/include -I/usr/src/sys/contrib/libfdt
> -D_KERNEL -DHAVE_KERNEL_OPTION_HEADERS -include opt_global.h
> -fno-common    -fno-omit-frame-pointer -mno-omit-leaf-frame-pointer
> -MD  -MF.depend.vfs_cache.o -MTvfs_cache.o
> -fdebug-prefix-map=3D./machine=3D/usr/src/sys/amd64/include
> -fdebug-prefix-map=3D./x86=3D/usr/src/sys/x86/include
> -fdebug-prefix-map=3D./i386=3D/usr/src/sys/i386/include -mcmodel=3Dkernel
> -mno-red-zone -mno-mmx -mno-sse -msoft-float
> -fno-asynchronous-unwind-tables -ffreestanding -fwrapv
> -fstack-protector -Wall -Wnested-externs -Wstrict-prototypes
> -Wmissing-prototypes -Wpointer-arith -Wcast-qual -Wundef
> -Wno-pointer-sign -D__printf__=3D__freebsd_kprintf__
> -Wmissing-include-dirs -fdiagnostics-show-option -Wno-unknown-pragmas
> -Wno-error=3Dtautological-compare -Wno-error=3Dempty-body
> -Wno-error=3Dparentheses-equality -Wno-error=3Dunused-function
> -Wno-error=3Dpointer-sign -Wno-error=3Dshift-negative-value
> -Wno-address-of-packed-member -Wno-error=3Darray-parameter
> -Wno-error=3Ddeprecated-non-prototype -Wno-error=3Dstrict-prototypes
> -Wno-error=3Dunused-but-set-variable -Wno-format-zero-length   -mno-aes
> -mno-avx  -std=3Diso9899:1999 -Werror /usr/src/sys/kern/vfs_cache.c"
>
> tl;dr if this goes in, the fate of memcpy thing will need to be
> handled separtely
> --
> Mateusz Guzik <mjguzik gmail.com>
>


--=20
Mateusz Guzik <mjguzik gmail.com>