svn commit: r343985 - head/sys/kern

Mon Feb 11 07:37:02 UTC 2019

On Sun, 10 Feb 2019, Conrad Meyer wrote:

> ...
>> This is the slowest correct fix in the PR followup.  kib predicted
>> that I wouldn't like it.  It does 2 64-bit divmods (after optimization)
>> and many multiplications per call.  Times 2 calls.  clang will probably
>> inline this, giving only 3 64-bit divmods instead of 4.
>
> Did you measure any of this, or is this speculation?  I plugged both
> versions into Godbolt just for amusement: https://godbolt.org/z/KE_FF8
> (GCC 8.2), https://godbolt.org/z/WSepYg (Clang 7.0.0).

This was unreadable using lynx.

I tested only later versions.  Counting divmods is adequate for seeing
how slow various methods are.  divmod is so slow that branches don't matter.

> Andrey's version has no branches; yours has two conditional branches
> as well as a large NOP to align the branch target (GCC); Clang manages
> only a single branch and doesn't pad the branch target.

I doubt that you tested gcc-4.2.1 or 32-bit arches in Godbolt, or the
flags that I use.  64-bit division is a libcall on i386.  This has many
more than 2 branches, and large code just to pass args.  I use

     gcc-4.2 [-m32 -mtune=i386] -Os -fno-inline-functions
 	-fno-inline-functions-called-once,

and this avoids all nops for alignment.  -Os is also an optimization for
time except compared very agressive optimizations for time that I don't
want since they are harder to debug using ddb.  For makeworld, all of
clang's optimizations for time at the cost of space and debugability
only gain about 10% sys time or 1% real time relative to my negative
optimizations for time.

The alignment might be the right thing for optimization.  I don't
like the __predict*() ugliness much, but here it serves to document
the optimization.  It says that the compiler should move the 'false'
case far away.  This might need nops that pessimize it further.

> Andrey's version has five divs at gcc8.2 -O2, and six imuls.

Not too good.  My -fno-inline-functions should prevent auto-inlining
of the function.  In practice, this flag is essential to unbreak
gcc-4.2 -Os, and unusable with clang since it breaks inlining of all
functions that are explicitly declared as inline, e.g., curthread.

Not inlining prevents reuse of common expressions.  5 divs is still
a lot.  Statically, the committed version has 3 muls, 3 divs and 3
mods per call.

I also considered imuls vs muls.  Muls are much faster than divs and
mods, so this doesn't matter a lot.  The code is pessimized by using
unsigned types.  This makes it hard to use imul instead of mul on
x86.  x86 has a few more imul instructions than mul instructions.
Some of the extras use less registers so should be preferred.

> In the happy case, your version has two cmp+ja, two divs, and two
> imuls.  In the unhappy case, your version has two cmp+ja, three div,
> and four imul.  Just eyeballing it, your code might be marginally
> larger, but it's fairly similar.

The unhappy case is about 0.01% of cases except for watching all threads
in the system where at least the idle threads have large runtimes and are
a much larger proportion of all threads than 0.01%.  Watching many threads
is inherently slow so I don't care much about it.

3 divs is already much faster than 5.  div takes about 20 cycles on x86
(slightky slower with 64 bits) and is hard to optimize, so other arches
(including x86 floating point) can't be much faster without using too much
hardware.

Look at the i386 code.  It is much larger for other reasons.  Mostly for
the unhappy case.

> Does it matter?  I doubt it.  Modern CPUs are crazy superscalar OOO

Not a lot.  i386 has enormous pessimizations and it is impossible to
recover using micro-optimizations like this.  Times for getrusage()
on Haswell i386:

- FreeBSD-~5.2           Athlon-XP: 2444 cycles
- FreeBSD-~11            Haswell:   1970 cycles
- FreeBSD-cur pae_mode=0 Haswell:   3260 cycles
- FreeBSD-cur pae_mode=1 Haswell:   5040 cycles (all anti-spectre, etc. off)

The last 2 have my fix with its small optimization.  All my other
optimizations reduce the 5040 cycles by 500.

amd64 in other benchmarks is slightly slower than in FreeBSD-11.

> magic and as long as there aren't bad data dependencies, it can cruise
> along.

It is already using that to run i386 64-bit code reasonably fast.
Dependencies are further apart because there is more to do in between.
But there are more branches (2 instead of 1 just to compare 64-bit
values), and divs are a larger problem.  I haven't noticed any x86 with
more than one ALU handling div.  Pipelines tend to stall waiting for div.
And values that actually need 64 bits need more than 1 div.

> All values reside in registers and imul isn't much slower than
> add.

I think most non-old x86 have 2 or 3 ALUs for mul, with a throughput of
1 per ALU per cycle and only the latency a bit worse than for add.

> div is a bit slower, but probably cheaper than an L1 miss.  Feel

Much slower than add.  More like 1 per 20 cycles throughput and latency.
But cache misses are hundreds of cycles.

Micro-benchmarks tend not to see cache misses.  The above benchmarks
just call getrusage() in a loop.  Everything should stay cached.  I
think lots of the large 5040 cycle time is for TLB misses.  FreeBSD-11
doesn't have those.  pae_mode=1 apparently makes them much larger.

> free to measure and demonstrate a difference if you feel it is
> important.  I don't care, as long as it's correct (which it was not
> for the past 14 years).

Easy using micro-benchmarks but not in normal use.  In normal operation,
calcru1() is only called for accumulating times in exit().

It has been incorrect for more like 25 years (since BSD-4.4Lite1 and
FreeBSD-2).  14 years is only the age of the first (?) PR about it.

Bruce