svn commit: r333240 - in head/sys: powerpc/powerpc sys

Sat May 5 00:38:40 UTC 2018

On Fri, 4 May 2018, Conrad Meyer wrote:

> On Fri, May 4, 2018 at 8:53 AM, Brooks Davis <brooks at freebsd.org> wrote:
>> On Fri, May 04, 2018 at 04:00:48AM +0000, Mateusz Guzik wrote:
>>> Author: mjg
>>> Date: Fri May  4 04:00:48 2018
>>> New Revision: 333240
>>> URL: https://svnweb.freebsd.org/changeset/base/333240
>>>
>>> Log:
>>>   Allow __builtin_memmove instead of bcopy for small buffers of known size
>>
>> What is the justification for forcing a size rather than using the
>> compiler's built-in fallback to memmove?

It is much the same as for using 'register' -- the programmer knows more than
the compiler (for 1% of the targets supported by the 1% of compilers known to
the programmer).

>> Is the kernel compilation
>> environment disabling that?
>
> Yes.  I think the compiler can't use its builtin bcopy() in standalone mode.

Not quite.  Builtin bcopy() doesn't exist even for clang (5.0.1) in
hosted mode.  But the question was about using __builtin_memmove().
This exists for both gcc and clang, but doesn't work right:

- for gcc4.2.1, __builtin_memmove() never works right.  It always falls back
   to memmove().  This is broken in the -ffreestanding case, since memmove()
   can be anything then.  Actually, in the kernel a function named memmove(9)
   exists and has the same semantics as memmove(3), but is slightly slower.

   So the effect of the change for gcc-4.2.1 is to pessimize the case that was
   supposed to be optimized: fixed-sized moves of size <= 64 are pessimized
   to use memmove() instead of bcopy(), while other args still use bcopy().

- for clang-5.0.1, on x86 for most -march settings, SSE is used even with
   -ffreestanding.  The kernel uses various -m flags to prevent use of SSE
   and the FPU.  This forces the compiler to copy through integer registers.
   This reduces the number of cases that must be analyzed to see how broken
   the hard-coded 64 is:
   - with AVX present and not killed, inlining occurs up to size 256 (8
     pairs of vmovups through 256-bit registers)
   - else with SSE present and not killed, inlining occurs up to size 128
     (8 pairs of movaps through 128-bit registers).  This also depends on
     other CPU capabilitities.  E.g., this uses movaps instead of movups
     since movups is slower on the target CPU (core2) even in the aligned
     case, while vmovups doesn't have this problem.
   - the above cases don't apply in the kernel
   - on 64-bit arches, inlining occurs up to size 64 (8 pairs movq through
     64-bit integer registers)
   - on 32-bit arches, inlining occurs up to size 64 (8 pairs movq through
     32-bit integer registers).

   So the effect of the change for clang is to pessimize half of the
   cases that were supposed to be optimized on i386 (and probably
   similarly on other 32-bit arches): fixed-sized moves of size > 32
   and <= 64 are pessimized to use memmove() instead of bcopy().

Summary: this change wouldn't have passed my review.  I have used similar
changes for 15-20 years but never made them production quality since there
are too many variations to consider and testing showed insignificant
improvements except for micro-benchmarks.

This change should be harmless if done right.  E.g., it could spell 64
as (8 * sizeof(register_t)) so as to not pessimize i386 and probably
similarly on other 32-bit arches.  It is still too chummy with the
implementations.  x86 compilers used to get this wrong too, and probably
still do except when they can use AVX.  gcc got it wrong 20-30 years
ago on i386 by inlining to "rep movsl" up to sizes much larger than 32.
Eventually it learned that it didn't understand memory and left this
to the library.  Now on x86 CPUs that have "fast-strings", inlined
"rep movsb" is better than most libraries.  x86 kernel "libraries" still
reduce to "rep movs*" plus lots of extra instructions that more than
double the overhead.  With fast-strings, the problem with using "rep
movsb" is almost entirely in its overhead.  On Haswell, IIRC, it takes
25 cycles to start up.  Then it copies 32 bytes per cycle if both the
source and the target are in the L1 cache.  So the copy size must be
25 * 32 = 800 just to amortize half of the startup overhead.  Even
copying through 32-bit integer registers is much faster than that.
There are often cache misses so that all methods are slow, but
micro-benchmarks rarely cover the slow cases.

I don't believe the claimed speeding of using the optimized bcopy()
in cpu_fetch_sycall_args() on amd64 (r333241).  This changes the copy
size from a variable to 6*8 bytes.  It is claimed to speed up getppid()
from 7.31M/sec to 10.65M/sec on Skylake.  But here it and other changes
in the last week give a small pessimization for getpid() on Haswell
at 4.08GHz: last week, 100M getpid()'s took 8.27 seconds (12.09M/sec).
Now they take 8.30 seconds (12.05M/sec).  (This is with very old
libraries, so there is no possibility of getpid() being done in
userland.)  0.03 seconds is 122.4M cycles.  So the speed difference
is 1.224 cycles.  Here the timing has a resolution of only 0.01 seconds,
so most digits in this 1.224 are noise, but the difference is apparently
a single cycle.  I would have expected more like the "rep movs*" setup
overhead of 25 cycles.

Another way to look at this is: 12.09M/sec syscalls means 82.71 nsec
or 337 cycles each.  Not too bad.  The cache latency for this system
according to lmbench-2.0 is 1 nsec for L1, 4.5 nsec for L2, and 16.3
nsec for main memory (old lmbench doesn't understand L3...).  66 cycles
for main memory.  I don't quite believe that.  100-200 cycles is more
common.  I think Skylake is slightly slower.  Anyway, there can't be
many cache misses in 337 cycles.  We can hope for improvements of ~25
cycles each by avoiding "rep movs*" instructions.  bcopy() actually
has 2 of these (it ends up with 1 rep movsb which is null except in
misaligned cases.  This started being pessimal with the i486 and is
relatively worse with fast-strings).  However, fixes in this area
rarely make as much difference as hoped for, except possibly on in-order
arches.  On Haswell and Skylake, everything is pipelined so it extra
cycles in bcopy() don't matter provided nothing depends on the result
very soon, and nearby code is equally inefficient and the resources used
by the bcopy() don't contend too much with the resources used by the
nearby code.  Resources used by bcopy() do contend a lot in practice,
but not 100%.

Bruce