svn commit: r333324 - in head/sys: amd64/amd64 conf

Bruce Evans brde at optusnet.com.au
Tue May 8 01:37:44 UTC 2018


On Tue, 8 May 2018, Bruce Evans wrote:

> On Mon, 7 May 2018, Mateusz Guzik wrote:
>
>> Log:
>>  amd64: replace libkern's memset and memmove with assembly variants
>> ...
>>  memset is repurposed memcpy. The librkern variant is doing fishy stuff,
>>  including branching on 0 and calling bzero.
>
> You mean the old libkern variant.  It just duplicates the inline
> implementation in a stupid way.  Another style bug in this is that the
> memset() (and also index() and rindex()) is implemented as an inline in
> <sys/libkern.h>, when newer mem*() are implemented as macros in
> <sys/systm.h>.  The inlines are harder to use and don't even support
> the hack of a special case if the size is constant and small.
>
> libkern/memcmp.c is almost never used, and neither is your asm version,
> so it shouldn't be optimized and your asm version shouldn't exist, and
> the bugs in the asm version rarely matter.  It is just a fallback for
> cases where inlining doesn't work and for use in function pointers.  I
> don't see how the inlining can ever not work, but there is an ifdef
> tangle that makes this non-obvious.

I checked in an old i386 kernel that the extern memset() is not called even
once during a makeworld.

But several object files reference it.  For a newer amd64 kernel, these
files are:
- huf_decompress.o
- iflib.o
- kern_event.o
- memset.o
- nfs_clvsops.o
- ppp_msq.o
- vga.o
- zstd_opt.o (this uses a private static version)

Bah, my test for this was broken because the naming error of having a private
copy of memset broke debugging -- ddb put the breakpoint at the version in
zstd_opt.o.  Repeating the makeworld with breakpoints at both versions still
showed no calls.

Debugging this in the preprocessed sources for vga.c shows that memset()
is never called there.  However, bzero() is called a lot, and it expands
to __builtin_memset() for the case that is supposed to be inlining
(constant size <= 64), but the compiler (gcc) generates code that calls
memset() anyway, and the extern memset() is needed for this fallback.
But apparently, the inlining is not that bad, so the memset() is never
actually called.

Something is wrong, since the calls to memset() are all with size 384
Oops.  The calls are actually to initialize local variables.  Some
code has the good pessimization of large local variables with
auto-initiazation to 0 on every entry to the function.  At least gcc
apparently uses memset() to initialize such variables, although this
is invalid in freestanding mode.  So the extern memset() is not just
a fallback for direct pessimizations.  The limited size of the kernel
stack makes this pessimization relatively rare.

Further analysis of calls to memset() in a full amd64 kernel:
- only 20 calls total
- the sizes are variable(perhaps for a VLA), 0x10, 0x10, 0x10, 0x10,
   0x2004, 0x4004, 0x4004, 0x2018, 0x60, 0xc0, 0x2a8, 0xf8, 0x100,
   0x180, 0x180, 0x180, 0x180, 0x180, 0x180 (the last 6 all for vga.c
   where the pessimization is in the slow path).
So almost no problems with large zero-initialized variables on the
stack either.

clang does excessive inlining, so of course it inlines most of these
initializations.  It actually inlines all except 6 (ones with size
0x2004, 0x2004, 0x4004, 0x4004, 0xf8, 0x100).  For the size 0x180 in
vga.c, clang generates only memcpy() where gcc generates memset() and
then load-stores and/or memcpy().  Reducing the slowness of large
auto-initialized local variables is an interesting problem.  memcpy()
from static data is simplest, but for sparse data memset() followed
by a few loads and stores is probably better.  I would have expected
clang to do the fancier optimization with memset().  I use -Os to
limit such optimizations (they are especially useless in the initialization
code in vga.c) but forget to use it for clang in this test.  So it is
even more surprising that gcc apparently does the fancier optimization
with memset().  memcpy() from static data would give smaller code though
larger data, and I think -Os is mostly to optimized code for space.

Bruce


More information about the svn-src-head mailing list