From nobody Thu Feb 09 20:53:34 2023 X-Original-To: freebsd-hackers@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4PCTc01SYFz3ngNH for ; Thu, 9 Feb 2023 20:53:36 +0000 (UTC) (envelope-from mjguzik@gmail.com) Received: from mail-oa1-x34.google.com (mail-oa1-x34.google.com [IPv6:2001:4860:4864:20::34]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1D4" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4PCTbz6jNqz42Cj; Thu, 9 Feb 2023 20:53:35 +0000 (UTC) (envelope-from mjguzik@gmail.com) Authentication-Results: mx1.freebsd.org; none Received: by mail-oa1-x34.google.com with SMTP id 586e51a60fabf-15ff0a1f735so4220257fac.5; Thu, 09 Feb 2023 12:53:35 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:cc:to:subject:message-id:date:from :references:in-reply-to:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=w9IqZLORatmwNq5xtYIAdz/iwBQguJphxQ1xCd3ujfI=; b=LB3KGCYE+L/s8XddkSQ0lAAXRGpOtbrGm2hskFTcu0MbN9FsX1JAl64NUO9YUbEWyG 3vQdkRS9Nzxz+xsfNoXm9a4bnAV1Dpwwl2dm2uYrkQOrGJ3dfSIePRejbjeAYhONVKWO bR2VnSg1xbNP3kFEalN2ImUuNSRj/cKP99NqAV5gY0N9KKr5o1A39gP8iHbed0d0W3E3 hV0093eW9FsquXYfflPeM8fXaMQXyVZwRJTFe4tDZ2i533Cnj6H5t6Hlb5TLZQwcEFOw cLa4ZWjrwcTETGrtJCPpYmUVgfh8gJMfGkETK5b1WjTB1X4nmd9ZhUmQY6glXh07CeeM ebeA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:cc:to:subject:message-id:date:from :references:in-reply-to:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=w9IqZLORatmwNq5xtYIAdz/iwBQguJphxQ1xCd3ujfI=; b=Kfv7QxRiUTWq2kZPgHUaWDwezQRXeqAwaUOXxVKK2WgwzkZe8x3g6ssnfhVD079PxQ VASzeyIb+ceRdwSFP3zMlNBhS731jOY+eILl1lDAqP7HpEQB6DYpgZJpILR7c/2fqmLb pufAa2ylS8msuqiw1L2KH9SXg3pZ3rjmoMEAugQNM5CG47sTSweESlBn0xeiNuKMUAFq TlDOcJErhjMZMg6fKzYea/e3wX8uzK3Zw5DkypUDE2m2pNxXNCFPAuFLdtlPw4YWfTcF BNSK0uDt6WIy18Wy2CJJMRjqFSQ/8CzoabN9kIufUtPjkSIdwp9dqSq25ugdL2Zlqfwd cnmA== X-Gm-Message-State: AO0yUKXC64WqcULd816LwgkWuChIndelvUVc54F2N6pkO+ZOnYSIABWt o7SSRb37SijPIwp7lv7HA6k6FZHqonLJ9xPwam9WoMkd X-Google-Smtp-Source: AK7set9u3oE24Dzb10sJ7sEGwRszv/ivCDlEcTeAxuBcTjB6x193AruTFVrE9C7yB2FMbmR82hCgDtvNH1UcqN84OL8= X-Received: by 2002:a05:6870:1257:b0:16a:9099:3868 with SMTP id 23-20020a056870125700b0016a90993868mr784833oao.81.1675976014964; Thu, 09 Feb 2023 12:53:34 -0800 (PST) List-Id: Technical discussions relating to FreeBSD List-Archive: https://lists.freebsd.org/archives/freebsd-hackers List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-hackers@freebsd.org MIME-Version: 1.0 Received: by 2002:ac9:6c92:0:b0:4b3:d953:974c with HTTP; Thu, 9 Feb 2023 12:53:34 -0800 (PST) In-Reply-To: References: <2f3dcda0-5135-290a-2dff-683b2e9fe271@FreeBSD.org> From: Mateusz Guzik Date: Thu, 9 Feb 2023 21:53:34 +0100 Message-ID: Subject: Re: CFT: snmalloc as libc malloc To: David Chisnall Cc: freebsd-hackers Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 4PCTbz6jNqz42Cj X-Spamd-Bar: ---- X-Spamd-Result: default: False [-4.00 / 15.00]; REPLY(-4.00)[]; ASN(0.00)[asn:15169, ipnet:2001:4860:4864::/48, country:US] X-Rspamd-Pre-Result: action=no action; module=replies; Message is reply to one we originated X-ThisMailContainsUnwantedMimeParts: N On 2/9/23, David Chisnall wrote: > On 9 Feb 2023, at 19:15, Mateusz Guzik wrote: >> >> it fails to build for me: >> >> /usr/src/lib/libc/stdlib/snmalloc/malloc.cc:35:10: fatal error: >> 'override/jemalloc_compat.cc' file not found >> #include "override/jemalloc_compat.cc" >> ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~ >> 1 error generated. >> --- malloc.o --- >> *** [malloc.o] Error code 1 >> >> make[4]: stopped in /usr/src/lib/libc >> /usr/src/lib/libc/stdlib/snmalloc/memcpy.cc:25:10: fatal error: >> 'global/memcpy.h' file not found >> #include >> ^~~~~~~~~~~~~~~~~ >> 1 error generated. >> --- memcpy.o --- >> *** [memcpy.o] Error code 1 > > This looks as if you haven=E2=80=99t got the submodule? Is there anythin= g in > contrib/snmalloc? > indeed, a pilot error >> anyway, I wanted to say I find the memcpy thing incredibly suspicious. >> I found one article in >> https://github.com/microsoft/snmalloc/blob/main/docs/security/GuardedMem= cpy.md >> which benches it and that made it even more suspicious. How did the >> benched memcpy look like inside? > > Perhaps you could share what you are suspicious about? I don=E2=80=99t r= eally know > how to respond to something so vague. The document you linked to has the > benchmark that we used (though the graphs in it appear to be based on an > older version of the memcpy). The PR that added PowerPC tuning has some > additional graphs of measurements. > > If you compile the memcpy file, you can see the assembly. The C++ provid= es > a set of building blocks for producing efficient memcpy implementations. First and foremost, perhaps I should clear up that I have no opinion on snmalloc or it replacing jemalloc. I'm here only about the memcpy thing. Inspecting assembly is what I intended to do, but got the compilation probl= em. So, as someone who worked on memcpy previously, I note the variant currently implemented in libc is pessimal for sizes > 16 bytes because it does not use SIMD. I do have plans to rectify that, but ENOTIME. I also note CPUs are incredibly picky when it comes to starting point of the routine. The officialy recommended alignment of 16 bytes is basically a tradeoff between total binary size and performance. Should you move the routine at different 16 bytes intervals you will easily find big variation in performance, depending on how big of an alignment you ended up with. I tried to compile the benchmark but got bested by c++. I don't know the lang and I don't want to fight it. $ pwd /usr/src/contrib/snmalloc/src $ c++ -I. test/perf/memcpy/memcpy.cc [snip] ./snmalloc/global/../backend/../backend_helpers/../mem/../ds_core/bits.h:57= :26: error: no template named 'is_integral_v' in namespace 'std'; did you mean 'is_integral'? static_assert(std::is_integral_v, "Type must be integral"); ~~~~~^~~~~~~~~~~~~ is_integral and tons of other errors. I did buildworld + installworld. I'm trying to say that if you end up with different funcs depending on bounds checking and you only align them to 16 bytes, the benchmark is most likely inaccurate if only for this reason. > The fastest on x86 is roughly: > > - A jump table of power for small sizes that do power-of-two-sized small > copies (double-word, word, half-word, and byte) to perform the copy. Last I checked this was not true. The last slow thing to do was to branch on few sizes and handle them with overlapping stores. Roughly what memcpy in libc is doing now. Jump table aside, you got all sizes "spelled out", that is just avoidable impact on icache. While overlapping stores do come with some penalty, it is cheaper than the above combined. I don't have numbers nor bench code handy, but if you insist on contesting the above I'll be glad to provide them. > - A vectorised copy for medium-sized copies using a loop of SSE copies. Depends on what you mean by medium and which simd instructions, but yes, they should be used here. > - rep movsb for large copies. There are still old cpus here and there which don't support ERMS. They are very negatively impacted the above and should roll with rep stosq instead. I see the code takes some care to align the target buffer. That's good, but not necessary on CPUs with FSRM. All that said, I have hard time believing the impact of bounds checking on short copies is around 20% or so. The code to do it looks super slow (not that I know to do better). People like to claim short sizes are inlined by the compiler, but that's only true if the size is known at compilation time, which it often is not. Then they claim they are rare. To illustrate why that's bogus, here is clang 15 compiling vfs_cache.c: value ------------- Distribution ------------- count -1 | 0 0 |@ 19758 1 |@@@@@@@@ 218420 2 |@@ 67670 4 |@@@@ 103914 8 |@@@@@@@@@@@ 301157 16 |@@@@@@@@@@ 293812 32 |@@ 57954 64 |@ 23818 128 | 13344 256 |@ 18507 512 | 6342 1024 | 1710 2048 | 627 4096 | 398 8192 | 34 16384 | 10 32768 | 6 65536 | 7 131072 | 4 262144 | 1 524288 | 1 1048576 | 0 obtained with this bad boy: dtrace -n 'pid$target::memcpy:entry { @ =3D quantize(arg2); }' -c "cc -target x86_64-unknown-freebsd14.0 --sysroot=3D/usr/obj/usr/src/amd64.amd64/tmp -B/usr/obj/usr/src/amd64.amd64/tmp/usr/bin -c -O2 -pipe -fno-strict-aliasing -g -nostdinc -I. -I/usr/src/sys -I/usr/src/sys/contrib/ck/include -I/usr/src/sys/contrib/libfdt -D_KERNEL -DHAVE_KERNEL_OPTION_HEADERS -include opt_global.h -fno-common -fno-omit-frame-pointer -mno-omit-leaf-frame-pointer -MD -MF.depend.vfs_cache.o -MTvfs_cache.o -fdebug-prefix-map=3D./machine=3D/usr/src/sys/amd64/include -fdebug-prefix-map=3D./x86=3D/usr/src/sys/x86/include -fdebug-prefix-map=3D./i386=3D/usr/src/sys/i386/include -mcmodel=3Dkernel -mno-red-zone -mno-mmx -mno-sse -msoft-float -fno-asynchronous-unwind-tables -ffreestanding -fwrapv -fstack-protector -Wall -Wnested-externs -Wstrict-prototypes -Wmissing-prototypes -Wpointer-arith -Wcast-qual -Wundef -Wno-pointer-sign -D__printf__=3D__freebsd_kprintf__ -Wmissing-include-dirs -fdiagnostics-show-option -Wno-unknown-pragmas -Wno-error=3Dtautological-compare -Wno-error=3Dempty-body -Wno-error=3Dparentheses-equality -Wno-error=3Dunused-function -Wno-error=3Dpointer-sign -Wno-error=3Dshift-negative-value -Wno-address-of-packed-member -Wno-error=3Darray-parameter -Wno-error=3Ddeprecated-non-prototype -Wno-error=3Dstrict-prototypes -Wno-error=3Dunused-but-set-variable -Wno-format-zero-length -mno-aes -mno-avx -std=3Diso9899:1999 -Werror /usr/src/sys/kern/vfs_cache.c" tl;dr if this goes in, the fate of memcpy thing will need to be handled separtely --=20 Mateusz Guzik