svn commit: r340472 - in head: lib/libc/amd64/string sys/amd64/amd64

Fri Nov 16 09:50:50 UTC 2018

On Fri, 16 Nov 2018, Mateusz Guzik wrote:

> Log:
>  amd64: handle small memset buffers with overlapping stores
>
>  Instead of jumping to locations which store the exact number of bytes,
>  use displacement to move the destination.
>
>  In particular the following clears an area between 8-16 (inclusive)
>  branch-free:
>
>  movq    %r10,(%rdi)
>  movq    %r10,-8(%rdi,%rcx)
>
>  For instance for rcx of 10 the second line is rdi + 10 - 8 = rdi + 2.
>  Writing 8 bytes starting at that offset overlaps with 6 bytes written
>  previously and writes 2 new, giving 10 in total.
>
>  Provides a nice win for smaller stores. Other ones are erratic depending
>  on the microarchitecture.
>
>  General idea taken from NetBSD (restricted use of the trick) and bionic
>  string functions (use for various ranges like in this patch).

Why not take such ideas from FreeBSD (or at least from FreeBSD committers)
where this one was used between 1996 and 2010 for the i586(npx)-optimized
bzero?

Testing showed that it wasn't a very good idea, so I didn't use it anywhere
else and didn't complain much when it was backed out.  It is not very good
since it pessimizes the usual case where everything is aligned.  Now it is
an even larger pessimization for the ERMS case, at least in theory, since
"rep movsb" should be able to handle alignment stuff.

Here it is for the version in FreeBSD-5:

XX i586_bz2:
XX 	fldz
XX 
XX 	/*
XX 	 * Align to an 8 byte boundary (misalignment in the main loop would
XX 	 * cost a factor of >= 2).  Avoid jumps (at little cost if it is
XX 	 * already aligned) by always zeroing 8 bytes and using the part up
XX 	 * to the _next_ alignment position.
XX 	 */
XX 	fstl	0(%edx)
XX 	addl	%edx,%ecx		/* part of %ecx -= new_%edx - %edx */
XX 	addl	$8,%edx
XX 	andl	$~7,%edx
XX 	subl	%edx,%ecx
XX 
XX 	/*
XX 	 * Similarly align `len' to a multiple of 8.
XX 	 */
XX 	fstl	-8(%edx,%ecx)
XX 	decl	%ecx
XX 	andl	$~7,%ecx

This even has comments.

The "little" cost mentioned in the comments is just the instruction fetch
cost plus an extra fstl in cases where everthing is aligned.  The first
fstl in the above doesn't have much extra cost since it replaces an fstl
in the loop later.  The second one is only needed when the alignment stuff
is needed (e.g., to write 14 bytes as 8+8 with an overlap of too), but it
is always done to reduces branches.

With too many instructions to fetch, code like this becomes almost as slow
as "rep movs".  IIRC, on Haswell, "rep movs[bwlq]" takes 25 cycles to start
up (most x86's take about that long to start up string instructions and
ERMS doesn't improve this on at least Haswell), and all cases have a
throughput of 32 bytes/cycle, so in 25 cycles 800 bytes can be copied and
for data smaller than about this size it is best not to use string
instructions, provided you don't use methods that take too many cycles to
start up.

Alignment stuff tends to take too many cycles to start up especially if it
has lots of branches which trash the branch target caches.  I think the
above takes about 3 cycles on Haswell, except for the fstls which are quite
slow.  IIRC, they have a throughput of about 1 every 2 cycles and a latency
of 4 or 8 cycles.

The above code was optimized for Pentium-1's where the times in cycles for
fstl were not much different from on newer CPUs, but everything else is
either wider or faster so going through npx registers is a pessimization.
Pentium-1's can barely saturate their slow L1 cache using npx instructions.

I don't like the way ERMS is used in -current on amd64:
- pagezero_erms: this is useless on at least Haswell since it has the same
   speed as pagezero_std.  ERMS makes "rep stosq" equally fast to "rep stosb"
   and everything is aligned so there is no difference in the setup overhead.
   (There might be a difference dividing the count by the access width, but
   this is done at compile time in pagezero_std, and in the overhead for doing
   this at runtime is in the noise.)
- memmove_erms: with a large size, this should just use "rep movsb" with
   almost no setup.  Instead, it uses the MEMMOVE macro to do lots of setup
   and to obfuscate its pessimizations.  Disassembly shows that memmove_erms
   ends up with 152 instructions while memmove_std ends up with 164
   instructions.  There is little difference except that memmove_std has an
   extra main loop doing "rep movsq".  This is implemented by testing the erms
   arg of MEMMOVE in just 2 places.  The non-erms case of course has to divide
   by the access width, and then has to arrange to copy any remaining bytes.
- memmove_std: this still has the pessimization of using "rep movsb" to
   finish up in some cases.  This costs ~25 cycles of setup overhead when it
   is reached.  erms as currently used gives almost no optimizations except
   by bypassing this pessimization.

   The finishing should be done using integer registers.  Only 1 store
   is needed using overlapping stores.  The macro makes it more difficult
   to remove this pessimization.
- memmove_erms again: with a small size, it is best not to use any string
   instructions.  I think the macro does this OK except it is hard to read.
   But optimizing for small sizes is especially unimportant.  Small sizes
   are inlined by the compiler in many cases, and when memmove* is actually
   called, it runs fast, at least if it doesn't use string instructions or
   have too much setup overhead to optimize the even more unimportant
   misaligned case.
- ifuncs: I don't like ifuncs.  The one for pagezero_erms is just an
   obfuscation.  The one for memmove_erms is just to save 1 whole predictable
   branch near the start of a general memmove function.  memmove_erms now has
   27 branches.  Most of them are no executed for most calls, but I doubt
   that they cost less than the 1 branch avoided using ifuncs.  The pre-erms
   version of bcopy() has just 1 branch (for the overlap case).

   Benchmarking branch costs is difficult and micro-benchmarks do it
   especially badly.  The size of the branch target cache on modern x86's
   is a few K.  That is enough to hold all branches for most micro-benchmarks,
   but kernels and applications are so bloated that it is not large enough
   to hold all branches for a few real applications.  memmove() might be
   called enough to keep its main branches cached, but that keeps the cache
   depleted for other uses.
- memcpy*: like memmove*, except it is simpler, but this is obfuscated using
   the MEMMOVE macro for memcpy too.
- memset*. similar, but even simpler since it is inherently simpler and the
   MEMSET macro is only used to obfuscate 2 functions.
- bzero: this better API is pessimized by going through memset() so that its
   fill word has to be constructed at runtime
- copyout*, copyin*: these use MEMMOVE for the erms control.  This requires
   further obfuscations in the macro, and duplicates the ~27 branches in
   memmove_erms many times.

I only did efficiency tests of this using the makeworld macro-benchmark.
This showed no signficant differences on Haswell.

I did some micro-benchmarks related to this in userland.  clang turns
simple C memcpy() (a loop with *dst++ = *src++) into sort of much better
code that all old and new "optimized" asm variants.  E.g., if -march
indicates that the CPU supports AVX*, then clang generates unportabley
AVX* code, which the asm variants can't reasonably do, especially in the
kernel.  This is only sort of better since, apart from its unportability,
clang generates its usualy excessive unrolling and too much setup code.
In practice, I couldn't get clang's AVX code to run faster than "rep movsb"
on freefall, by on my local Haswell system my benchmarks using AVX in asm
run faster than "rep movsb").  gcc-4.2 doesn't support AVX* and mostly
generates worse non-AVX code than clang, but sometimes it generates better
code.  Both compilers generated worse code for i386 with -m32.  So for
compilers as well as for asm authors, the best code is very MD and is
rarely produced or written.

Bruce