svn commit: r333240 - in head/sys: powerpc/powerpc sys

Sat May 5 01:46:58 UTC 2018

On Sat, 5 May 2018, Mateusz Guzik wrote:

> On Fri, May 4, 2018 at 5:53 PM, Brooks Davis <brooks at freebsd.org> wrote:
>
>> On Fri, May 04, 2018 at 04:00:48AM +0000, Mateusz Guzik wrote:
>>> Author: mjg
>>> Date: Fri May  4 04:00:48 2018
>>> New Revision: 333240
>>> URL: https://svnweb.freebsd.org/changeset/base/333240
>>>
>>> Log:
>>>   Allow __builtin_memmove instead of bcopy for small buffers of known
>> size
>>
>> What is the justification for forcing a size rather than using the
>> compiler's built-in fallback to memmove?  Is the kernel compilation
>> environment disabling that?
>>
> It will fallback to memmove which is super pessimized as being wrapped
> around bcopy.

Actually, the pessimization is tiny.  A quick test in userland on freefall
gives:
- 22.81 seconds for 10**9 bcopy()s of 128 bytes (75 cycles each; bandwidth
   5.61G/sec)
- 23.43 seconds for 10**9 mymemmove()s of 128 bytes where mymemmove() ==
   kernel wrapper of bcopy() (77.3 cycles each; bandwidth 5.46G/sec)
but that was only for the first run.  On another run, the bcopy()s took
23.11 seconds and the mymmemove()s took 22.62 seconds.  So the errors
in the measurement are much larger than the 2-cycle difference.  Nevertheless,
I expect the difference to be about 2 cycles or 3%.

Most of the bandwidth is wasted by both methods.  gcc inlines memcpy()
to 16 movq pairs and this takes 4.38 to 5.16 seconds (it tends to get
faster with every run, which I suspect is due to SCHED_ULE affinity not
understanding HTT but getting better with training -- I have been fixing
this in SCHED_4BSD).
- 4.38 seconds for 10**9 memcpys()s of 128 bytes (14.5 cycles each;
   bandwidth 29.22G/sec)
Even without fancy -march, clang inlines memcpy() to movaps on vanilla
amd64 since SSE is available:
- 2.84-2.89 seconds for 10**9 memcpys()s of 128 bytes (9.4 cycles each;
   bandwidth 45.07G/sec).
clang does a weird optimization for my counting loop unless its counter is
declared volatile -- it doesn't optimize the loop to a single copy, but
reduces the count by 10 each iteration instead of the expected 1 or
everything.  Semi-semi-finally, with -march=native to get AVX instead of
SSE:
- 1.65-1.88 seconds for 10**9 memcpys()s of 128 bytes (5.5 cycles each;
   bandwidth 77.57G/sec).
This is still far short of the 128-non-disk-manufacturersG that is reached
by Haswell at 4GHz using simple "rep movsb".  The size is too small to
amortize the overhead.  Semi-finally, with AVX and the size doubled to
256 bytes:
- 2.98-3.06 seconds for 10**9 memcpys()s of 256 bytes (9.8 cycles each;
   bandwidth 85.90G/sec).
Finally, with the size increased to 4K and the count reduced to 10M:
- 0.62-0.66 seconds for 10**7 memcpys()s of 4K bytes (204.6 cycles each;
   bandwidth 66.02G/sec).
Oops, that wasn't final.  It shows that freefall's Xeon behaves much like
Haswell.  The user library is clueless about all this and just uses
"rep movsq" and even 4K is too small to amortize the startup overhead of
not-so-fast-strings.  With the size increased to 8K and the count kept at
10M:
- 1.12-1.16 seconds for 10**7 memcpys()s of 8K bytes (369.6 cycles each;
   bandwidth 73.14G/sec).
This almost reaches the AVX bandwidth.  But the bandwidth is only this high
with both the source and the target in the L1 cache.  Doubling the size
again runs out of L1.  So there is no size large enough to amortize the
~25 cycle startup overhead of not-so-fast-strings.

> These limits will get lifted after the fallback routines get sorted out.

They should be left out to begin with.  Any hard-coded limits are sure to
be wrong for many cases.  E.g., 32 is wrong on i386.  Using
__builtin_memmove() at all is a pessimization for gcc-4.2.1, but that can
be fixed (reduced to a null change) by sorting out memmove().

Bruce