svn commit: r323329 - head/sys/sys

Sat Sep 9 09:40:58 UTC 2017

On Sat, 9 Sep 2017, David Chisnall wrote:

> On 8 Sep 2017, at 21:09, Mateusz Guzik <mjg at FreeBSD.org> wrote:
>>
>> Author: mjg
>> Date: Fri Sep  8 20:09:14 2017
>> New Revision: 323329
>> URL: https://svnweb.freebsd.org/changeset/base/323329
>>
>> Log:
>>  Allow __builtin_memset instead of bzero for small buffers of known size
>
> This change seems redundant, because modern compilers already do this optimisation.  For example:
>
> 	#include <strings.h>
>
> 	char buf[42];
>
> 	void bz(void)
> 	{
>        	bzero(buf, 42);
> 	}
>
> With clang 4.0 on x86 compiles to:
>
>        pushq   %rbp
>        movq    %rsp, %rbp
>        xorps   %xmm0, %xmm0
>        movups  %xmm0, buf+26(%rip)
>        movaps  %xmm0, buf+16(%rip)
>        movaps  %xmm0, buf(%rip)
>        popq    %rbp
>        retq

This is only a valid optimization for POSIX code.  Compiling with -std=c99
turns it off, even though the non-C99 <strings.h> is included.

Note that -std is broken in CFLAGS for the kernel.  It is c99 (actually
iso9899:1999).  Userland uses the correct standard gnu99.  -std=c99 breaks
other gnu features like unnamed struct/unions which are then unbroken
bogusly using -fms-extensions.

> Neither contains a call, both have inlined the zeroing.  This change is strictly worse, because the compiler has some carefully tuned heuristics that are set per target for when to inline the memset / bzero and when to call the function.  These are based on both the size and the alignment, including whether the target supports misaligned accesses and whether misaligned accesses are cheap.  None of this is captured by this change.

It is strictly better than plain bzero() in the kernel, and makes no
difference in userland.

> In the kernel, this optimisation is disabled by -ffreestanding, however __builtin_memset will be turned into a memset call if the size is not constant or if the memset call would be more efficient (as determined by the aforementioned heuristics).  Simply using __builtin_memset in all cases should give better code, and is more likely to be forward compatible with future ISAs where the arbitrary constant picked in this patch may or may not be optimal.

Compilers don't really understand this.  In the kernel, extern memset()
is strictly worse than extern bzero(), since memset() is just a wrapper
around bzero().  Thus simply using __builtin_memset() gives worse
strictly code in all cases that are not inlined.  Not a large amount
worse since the wrapper only does an extra branch and function call
when it reduces to bzero().  memset() is just slow in other cases.

The constant is not arbitrary, but is hard-coded to the x86 value which is
more arbitrary.  x86 compilers know that they don't really understand memory
so they hard-code a conservative value of 64.  It is only possible do better
by compiling for a specific arch and knowing how good or bad the extern
function is.  64 is about the smallest value above which the extern function
could possibly do significantly better by knowing or doing more.  Usually it
does slightly worse near 64 and also higher since it knows less, especially
if the compilation was for a specific arch (clang seems to have even more
methods than my benchmarks for this -- hundreds instead of tens --, while
x86 kernels have only one method optimized for 25-year old arches and now
OK again on new arches with fast string instructions (above a few hundred
bytes).

Bruce