svn commit: r333324 - in head/sys: amd64/amd64 conf

Mon May 7 23:35:29 UTC 2018

On Mon, 7 May 2018, Mateusz Guzik wrote:

> Log:
>  amd64: replace libkern's memset and memmove with assembly variants
>
>  memmove is repurposed bcopy (arguments swapped, return value added)
>  The libkern variant is a wrapper around bcopy, so this is a big
>  improvement.

This is a tiny improvement, perhaps negative.

As previously explained, the wrapper only costs a couple of cycles, and
bcopy() is not as good as it could be.

But the simple C versions are much faster than the asm versions.  The
asm versions still use 2 "rep movs*", so they take about 50 cycles to
start up.  The simple C loop can do about 50 loads and stores in 50
cycles, so even with byte accesses it handles about 50 bytes before
the asm version does anything.  So this change is a small pessimization
if the size is usually small.

>  memset is repurposed memcpy. The librkern variant is doing fishy stuff,
>  including branching on 0 and calling bzero.

You mean the old libkern variant.  It just duplicates the inline
implementation in a stupid way.  Another style bug in this is that the
memset() (and also index() and rindex()) is implemented as an inline in
<sys/libkern.h>, when newer mem*() are implemented as macros in
<sys/systm.h>.  The inlines are harder to use and don't even support
the hack of a special case if the size is constant and small.

libkern/memcmp.c is almost never used, and neither is your asm version,
so it shouldn't be optimized and your asm version shouldn't exist, and
the bugs in the asm version rarely matter.  It is just a fallback for
cases where inlining doesn't work and for use in function pointers.  I
don't see how the inlining can ever not work, but there is an ifdef
tangle that makes this non-obvious.

libkern/memcpy.c gets tangled up with the complications.  It can't just
use the inline version because of namespace problems.  With a macro
version, it would define itself as (memset) and then just invoke the
macro version as plain memset.  With better names, the inline version
would be defined with a name like __memset() and memset() would be defined
as __memset() and the extern version would just use __memset().  Instead,
libkern/memcpy.c defines LIBKERN_INLINE to tell <sys/libkern.h> to get out
of the way.  This is misnamed since it kills inlines instead of enabling
them.  Without it, all the inlines in libkern.h are 'static __inline'.
Since they are static, they are always used, so the extern version is never
used except possibly for function pointers.  (Inlining may fail, but then
a static function is used.  This happens mainly with -O0.)  The final details
for memset() are that when LIBKERN_INLINE is defined, this is used as part
of the function prototype; it also prevents the definition of LIBKERN_BODY
and this is what kills production of the inline version.  So libkern/memset.c
gets only a prototype for memset().  Similarly, but bogusly for other files
that define LIBKERN_INLINE.  But normal files get the inline version.

libc also has many silly "optimizations", even in its MI parts, so copying
it to create libkern is not so good.  E.g., on amd64, strcpy() has been
pessimized by turning it into a wrapper around stpcpy().  amd64 stpcpy()
is in asm and uses a masking trick to reduce to word-sized accesses.  This
is slower for short strings and probably should be MI if it is done at all.
OTOH, i386 strcpy() is not a wrapper, but is still the ~25-30 year old
version that does does manual unrolling in asm.  At least it doesn't use
any x86 string instructions.  i386 used to have strlen() is asm, but this
was intentionally not done for amd64, and now both use the MI strlen()
which has the fancy masking optimization.  Compilers often inline strlen()
so it is unclear how often this is used.

>  Both functions are rather crude and subject to partial depessimization.

Indeed, they are fully pessimized using 2 "rep movs".

>  This is a soft prerequisite to adding variants utilizing the
>  'Enhanced REP MOVSB/STOSB' bit and let the kernel patch at runtime.

Ugh.  Dynamic changes also complicate debugging and profiling.  Stack
traces fil up with <function renamed> <value optimized out>.

I checked the effect of previous micro-optimizations for makeworld.
Makeworld for a non-bloated (~5.2) world does about 120 million calls
to copying functions (I added counters and zapped the "rep movsb" for
all such functions in amd64/support.S).  Previous optimizations for
inlining, or just zapping "rep movsb", saves about 50 cycles/call.
That is 6 billion cycles altogther, which take about 1.5 seconds at
4GHz, but HTT makes times about 50% longer, so scale this up to 2.25
seconds.  Divide by the number of CPUs (8) to get the improvement in
real time.  It is 0.28 seconds.  This is in the noise.  (The total
sys time is about 100 seconds and the total real time is about 150
seconds, and the noses is mostly in the idle time (standard deviation
about 1% for real time but about 4% for idle time.  The estimated
0.28 seconds is about 0.2% of the real time and could be confirmed
by careful measurements, but I only did a quick test that the sys
time was reduced by a few seconds).

It is hard to get excited about optimizations of 0.2%, but that is
exactly what I am doing for my current project of scheduler
optimizations :-).

Bruce