svn commit: r333240 - in head/sys: powerpc/powerpc sys

Sat May 5 16:08:16 UTC 2018

On Sat, May 5, 2018 at 2:38 AM, Bruce Evans <brde at optusnet.com.au> wrote:

I don't believe the claimed speeding of using the optimized bcopy()
> in cpu_fetch_sycall_args() on amd64 (r333241).  This changes the copy
> size from a variable to 6*8 bytes.  It is claimed to speed up getppid()
> from 7.31M/sec to 10.65M/sec on Skylake.  But here it and other changes
> in the last week give a small pessimization for getpid() on Haswell
> at 4.08GHz: last week, 100M getpid()'s took 8.27 seconds (12.09M/sec).
> Now they take 8.30 seconds (12.05M/sec).  (This is with very old
> libraries, so there is no possibility of getpid() being done in
> userland.)  0.03 seconds is 122.4M cycles.  So the speed difference
> is 1.224 cycles.  Here the timing has a resolution of only 0.01 seconds,
> so most digits in this 1.224 are noise, but the difference is apparently
> a single cycle.  I would have expected more like the "rep movs*" setup
> overhead of 25 cycles.
>
>
The mail below only deals with performance claims on amd64. I'll see about
gcc 4.2.1 vs 32-bit archs later and other claims later.

It is unclear to me whether you actually benchmarked syscall performance
before and after the change nor how you measured a slowdown in getpid.

This mail outlines what was tested and how.

If you want, you can mail me off list and we can arrange so that you get
root access to the test box and can measure things yourself, boot your
own kernel and whatnot.

My preferred way of measurement is with this suite:
https://github.com/antonblanchard/will-it-scale

Unfortunately it requires a little bit of patching to setup.

Upside is cheap processing: there is a dedicated process/thread which
collects results once a second, other than that the processes/threads
running the test only do the test and bump the iteration counter.
getppid looks like this:
        while (1) {
                getppid();
                (*iterations)++;
        }

If you are interested in trying it out yourself without getting access to
the box in question I'm happy to provide a bundle which should be easily
compilable.

Perhaps you are using the tool which can be found here:
tools/tools/syscall_timing

It reports significantly lower numbers (even 20% less) because the test
loop itself has just more overhead.

For the first testing method results are as I wrote in the commit message,
with one caveat that I disabled frequency scaling and they went down a
little
bit (i.e. NOT boosted freq, but having it disabled makes things tad bit
slower; *boosted* numbers are significantly bigger but also not reliable).

The syscall path has long standing pessimization of using a function pointer
to get to the arguments. This fact was utilized to provide different
implementations switchable at runtime (thanks to kgdb -w and the set
command).

The variants include:
1. cpu_fetch_syscall_args

This is the routine as present in head, i.e. with inlined memcpy

2. cpu_fetch_syscall_args_bcopy

State prior to my changes, i.e. a bcopy call with a variable size

3. cpu_fetch_syscall_args_oolmemcpy

Forced to not be inlined memcpy with a constant size, the code itself
is the same as for regular memcpy

4. cpu_fetch_syscall_args_oolmemcpy_es

Forced to not be inlined memcpy with a constant size, the code itself
was modified to utilize the 'Enhanced REP MOVSB/STOSB' bit present on
Intel cpus made past 2011 or so.

The code can be found here: https://people.freebsd.org/~mjg/copyhacks.diff
(note: declarations where not added, so WERROR= or similar is needed to
get this to compile)

Variants were switched at runtime like this:
# kgdb -w
(kgdb) set
elf64_freebsd_sysvec.sv_fetch_syscall_args=cpu_fetch_syscall_args_bcopy

The frequency is fixed with:

# sysctl dev.cpu.0.freq=2100

PTI was disabled (vm.pmap.pti=0 in loader.conf).

Now, quick numbers from will it scale:
(kgdb) set
elf64_freebsd_sysvec.sv_fetch_syscall_args=cpu_fetch_syscall_args_bcopy

min:7017152 max:7017152 total:7017152
min:7023115 max:7023115 total:7023115
min:7018879 max:7018879 total:7018879

(kgdb) set elf64_freebsd_sysvec.sv_fetch_syscall_args=cpu_fetch_syscall_args

min:9914511 max:9914511 total:9914511
min:9915234 max:9915234 total:9915234
min:9914583 max:9914583 total:9914583

But perhaps you don't trust this tool and prefer the in-base one.
Note higher overhead of the test infra, thus lower numbers.

(kgdb) set
elf64_freebsd_sysvec.sv_fetch_syscall_args=cpu_fetch_syscall_args_bcopy

getppid 20      1.061986919     6251542 0.000000169

(kgdb) set
elf64_freebsd_sysvec.sv_fetch_syscall_args=cpu_fetch_syscall_args_oolmemcpy

getppid 79      1.062522431     6245666 0.000000170

(kgdb) set
elf64_freebsd_sysvec.sv_fetch_syscall_args=cpu_fetch_syscall_args_oolmemcpy_es

getppid 107     1.059988384     7538473 0.000000140

(kgdb) set elf64_freebsd_sysvec.sv_fetch_syscall_args=cpu_fetch_syscall_args

getppid 130     1.059987532     8057928 0.000000131

As you can see the original code (doing bcopy) is way slower than the
inlined variant.

Not taking advantage of the EMRS bit can be now fixed at runtime thanks to
recently landed ifunc support.

bcopy code can be further depessimized:

ENTRY(bcopy)
        PUSH_FRAME_POINTER
        xchgq   %rsi,%rdi

xchg is known to be slow, the preferred way is to swap registers "by hand".
the fact that the func always has to do is is a bug on its own.
Interestingly
memmove *does not have* to do it, so this provides even more reasons to just
remove this function in the first place.

        movq    %rdi,%rax
        subq    %rsi,%rax
        cmpq    %rcx,%rax                       /* overlapping && src <
dst? */
        jb      1f

Avoidable (for memcpy-compatible callers) branch, although statically
predicted
as not taken (memcpy friendly).

        shrq    $3,%rcx                         /* copy by 64-bit words */
        rep
        movsq
        movq    %rdx,%rcx
        andq    $7,%rcx                         /* any bytes left? */
        rep
        movsb

The old string copy. Note that reordering andq prior to rep movsb can help
some microarchitectures.

The movsq -> movsb combo induces significant stalls in the cpu frontend.

If the target size is a multiple of 8 and is known at compilation time, we
can get away with calling a variant which only deals with this case and
thus avoid the extra penalty.

Either way, the win *is here* and is definitely not in the range of 3%.

-- 
Mateusz Guzik <mjguzik gmail.com>