svn commit: r312600 - head/sys/kern

Sun Jan 22 12:41:21 UTC 2017

On Sun, 22 Jan 2017, Mateusz Guzik wrote:

> On Sun, Jan 22, 2017 at 12:07:56PM +1100, Bruce Evans wrote:
> ...
>> Later commits further unimproved style by adding __predict_ugly() and
>> long lines from blind substitution of that.   __predict_ugly() is almost
>> as useful as 'register', but more invasive.
>
> There were no __predict changes to this function.
>
> I did add some to vn_closefile in r312602.
>
> Indeed, one line is now 82 chars and arguably could be modified to fit
> the limit.
>
> I have to disagree about the usefulness remark. If you check generated
> assembly for amd64 (included below), you will see the uncommon code is
> moved out of the way and in particular there are no forward jumps in the
> common case.

Check benchmarks.  A few cycles here and there are in the noise.  Kernel
code has very few possibilities for useful optimizations since it doesn't
have many inner loops.

> With __predict_false:
>
> [snip prologue]
>   0xffffffff8084ecaf <+31>:	mov    0x24(%rbx),%eax
>   0xffffffff8084ecb2 <+34>:	test   $0x40,%ah
>   0xffffffff8084ecb5 <+37>:	jne    0xffffffff8084ece2 <vn_closefile+82>

All that this does is as you say -- invert the condition to jump to the
uncommon code.  This made more of a difference on old CPUs (possiblly
still on low end/non-x86).  Now branch predictors are too good for the
slow case to be much slower.

I think x86 branch predictors initially predict forward branches as not
taken and backward branches as taken.  So this branch was initially
mispredicted, and the change fixes this.  But the first branch doesn't
really matter.  Starting up takes hundreds or thousands of cycles for
cache misses.

Moving the uncommon code out of the way reduces Icache pressure a bit.
I've never been able to measure a significant difference from this.

However, I usually compile with -Os to reduce Icache pressure.  This
is almost as good for speed as -O (typically at most 5% slower than
-O2 for kernels). The other day, I tried compiling with -O2
-march=highest-supported and found that on a network microbenchmark
which was supposed to benefit from this, it was significantly slower.
It was with a very old version of gcc.  I think this is because -O2
rearranges the code a lot, and since at least the old version of gcc
doesn't understand caches at all, the arrangement accidentally ended
up much worse and caused more cache misses.  With another compiler and
NIC, -O2 gave the expected speeding.

>   0xffffffff8084ecb7 <+39>:	mov    0x24(%rbx),%esi
>   0xffffffff8084ecba <+42>:	mov    0x10(%rbx),%rdx
>   0xffffffff8084ecbe <+46>:	mov    %r14,%rdi
>   0xffffffff8084ecc1 <+49>:	mov    %r15,%rcx
>   0xffffffff8084ecc4 <+52>:	callq  0xffffffff80850000 <vn_close>
>   0xffffffff8084ecc9 <+57>:	mov    %eax,%r15d
>   0xffffffff8084eccc <+60>:	mov    0x24(%rbx),%eax
>   0xffffffff8084eccf <+63>:	test   $0x40,%ah
>   0xffffffff8084ecd2 <+66>:	jne    0xffffffff8084ecf5 <vn_closefile+101>
> [snip cleanup]
>   0xffffffff8084ece1 <+81>:	retq

The code for he uncommon case is now here (not shown).  It will be no
smaller (actually slightly larger for a branch back) unless it can be
shared.

>
> Without __predict_false:
> [snip prologue]
>   0xffffffff8084ecaf <+31>:	mov    0x24(%rbx),%eax
>   0xffffffff8084ecb2 <+34>:	test   $0x40,%ah
>   0xffffffff8084ecb5 <+37>:	je     0xffffffff8084ecc8 <vn_closefile+56>

The branch is not very far away, so perhaps the optimization from the move
is null.

>   0xffffffff8084ecb7 <+39>:	movzwl 0x20(%rbx),%eax
>   0xffffffff8084ecbb <+43>:	cmp    $0x1,%eax
>   0xffffffff8084ecbe <+46>:	jne    0xffffffff8084ecc8 <vn_closefile+56>
>   0xffffffff8084ecc0 <+48>:	mov    %r14,%rdi
>   0xffffffff8084ecc3 <+51>:	callq  0xffffffff8083e270 <vrefact>

The branch was to here <+56>.  The function is not very well aligned.  It
has address 0x90 mod 0x100, so starts 0x10 into a 64-bit cache line (is that
still the line size for all modern x86?).  So 48 bytes are in a line, but
this +56 is in the next line.

But aligning functions and branches to cache line boundaries works poorly
in general.  It wastes Icache in another way.  -march affects this a lot,
and -march=higher is often worse since the compiler doesn't really understand
this and does more or less alignment than is good.  Here it didn't bother
aligning even 1 of the branch targets.

>   0xffffffff8084ecc8 <+56>:	mov    0x24(%rbx),%esi
>   0xffffffff8084eccb <+59>:	mov    0x10(%rbx),%rdx
>   0xffffffff8084eccf <+63>:	mov    %r14,%rdi
>   0xffffffff8084ecd2 <+66>:	mov    %r15,%rcx

The main code ends up just +17 later at <+69>.  Fitting it all below <+64>
would be better but is not necessary if the instructions are slow enough
to all the instruction fetch to run ahead.  This depends on good prediction
to not start too many instructions in paths that won't be followed.

>   0xffffffff8084ecd5 <+69>:	callq  0xffffffff80850000 <vn_close>
>   0xffffffff8084ecda <+74>:	mov    %eax,%r15d
>   0xffffffff8084ecdd <+77>:	mov    0x24(%rbx),%eax
>   0xffffffff8084ece0 <+80>:	test   $0x40,%ah
>   0xffffffff8084ece3 <+83>:	je     0xffffffff8084ed45 <vn_closefile+181>
>   0xffffffff8084ece5 <+85>:	movzwl 0x20(%rbx),%eax
>   0xffffffff8084ece9 <+89>:	cmp    $0x1,%eax
>   0xffffffff8084ecec <+92>:	jne    0xffffffff8084ed45 <vn_closefile+181>
>   0xffffffff8084ecee <+94>:	movw   $0x0,-0x22(%rbp)
> [skip some parts, +181 marks the beginning of cleanup]
>   0xffffffff8084ed52 <+194>:	retq

Bruce