svn commit: r312600 - head/sys/kern

Sun Jan 22 12:57:22 UTC 2017

On Sun, Jan 22, 2017 at 11:41:09PM +1100, Bruce Evans wrote:
> On Sun, 22 Jan 2017, Mateusz Guzik wrote:
> 
> > On Sun, Jan 22, 2017 at 12:07:56PM +1100, Bruce Evans wrote:
> > ...
> >> Later commits further unimproved style by adding __predict_ugly() and
> >> long lines from blind substitution of that.   __predict_ugly() is almost
> >> as useful as 'register', but more invasive.
> >
> > There were no __predict changes to this function.
> >
> > I did add some to vn_closefile in r312602.
> >
> > Indeed, one line is now 82 chars and arguably could be modified to fit
> > the limit.
> >
> > I have to disagree about the usefulness remark. If you check generated
> > assembly for amd64 (included below), you will see the uncommon code is
> > moved out of the way and in particular there are no forward jumps in the
> > common case.
> 
> Check benchmarks.  A few cycles here and there are in the noise.  Kernel
> code has very few possibilities for useful optimizations since it doesn't
> have many inner loops.
> 
> > With __predict_false:
> >
> > [snip prologue]
> >   0xffffffff8084ecaf <+31>:	mov    0x24(%rbx),%eax
> >   0xffffffff8084ecb2 <+34>:	test   $0x40,%ah
> >   0xffffffff8084ecb5 <+37>:	jne    0xffffffff8084ece2 <vn_closefile+82>
> 
> All that this does is as you say -- invert the condition to jump to the
> uncommon code.  This made more of a difference on old CPUs (possiblly
> still on low end/non-x86).  Now branch predictors are too good for the
> slow case to be much slower.
> 
> I think x86 branch predictors initially predict forward branches as not
> taken and backward branches as taken.  So this branch was initially
> mispredicted, and the change fixes this.  But the first branch doesn't
> really matter.  Starting up takes hundreds or thousands of cycles for
> cache misses.
This is only true if branch predictor memory is large enough to keep the
state for the given branch between exercising it.  Even if the predictor
state could be attached to every byte in the icache, or more likely,
every line in the icache or uop cache, it still probably too small to
survive between user->kernel transitions for syscalls.  Might be there is
performance counter which shows branch predictor mis-predictions.

In other words, I suspect that there almost all cases might be
mis-predictions without manual hint, and mis-predictions together with
the full pipeline flush on VFS-intensive load very well might give tens
percents of the total cycles on the modern cores.

Just speculation.
> 
> Moving the uncommon code out of the way reduces Icache pressure a bit.
> I've never been able to measure a significant difference from this.
> 
> However, I usually compile with -Os to reduce Icache pressure.  This
> is almost as good for speed as -O (typically at most 5% slower than
> -O2 for kernels). The other day, I tried compiling with -O2
> -march=highest-supported and found that on a network microbenchmark
> which was supposed to benefit from this, it was significantly slower.
> It was with a very old version of gcc.  I think this is because -O2
> rearranges the code a lot, and since at least the old version of gcc
> doesn't understand caches at all, the arrangement accidentally ended
> up much worse and caused more cache misses.  With another compiler and
> NIC, -O2 gave the expected speeding.
> 
> >   0xffffffff8084ecb7 <+39>:	mov    0x24(%rbx),%esi
> >   0xffffffff8084ecba <+42>:	mov    0x10(%rbx),%rdx
> >   0xffffffff8084ecbe <+46>:	mov    %r14,%rdi
> >   0xffffffff8084ecc1 <+49>:	mov    %r15,%rcx
> >   0xffffffff8084ecc4 <+52>:	callq  0xffffffff80850000 <vn_close>
> >   0xffffffff8084ecc9 <+57>:	mov    %eax,%r15d
> >   0xffffffff8084eccc <+60>:	mov    0x24(%rbx),%eax
> >   0xffffffff8084eccf <+63>:	test   $0x40,%ah
> >   0xffffffff8084ecd2 <+66>:	jne    0xffffffff8084ecf5 <vn_closefile+101>
> > [snip cleanup]
> >   0xffffffff8084ece1 <+81>:	retq
> 
> The code for he uncommon case is now here (not shown).  It will be no
> smaller (actually slightly larger for a branch back) unless it can be
> shared.
> 
> >
> > Without __predict_false:
> > [snip prologue]
> >   0xffffffff8084ecaf <+31>:	mov    0x24(%rbx),%eax
> >   0xffffffff8084ecb2 <+34>:	test   $0x40,%ah
> >   0xffffffff8084ecb5 <+37>:	je     0xffffffff8084ecc8 <vn_closefile+56>
> 
> The branch is not very far away, so perhaps the optimization from the move
> is null.
> 
> >   0xffffffff8084ecb7 <+39>:	movzwl 0x20(%rbx),%eax
> >   0xffffffff8084ecbb <+43>:	cmp    $0x1,%eax
> >   0xffffffff8084ecbe <+46>:	jne    0xffffffff8084ecc8 <vn_closefile+56>
> >   0xffffffff8084ecc0 <+48>:	mov    %r14,%rdi
> >   0xffffffff8084ecc3 <+51>:	callq  0xffffffff8083e270 <vrefact>
> 
> The branch was to here <+56>.  The function is not very well aligned.  It
> has address 0x90 mod 0x100, so starts 0x10 into a 64-bit cache line (is that
> still the line size for all modern x86?).  So 48 bytes are in a line, but
> this +56 is in the next line.
> 
> But aligning functions and branches to cache line boundaries works poorly
> in general.  It wastes Icache in another way.  -march affects this a lot,
> and -march=higher is often worse since the compiler doesn't really understand
> this and does more or less alignment than is good.  Here it didn't bother
> aligning even 1 of the branch targets.
> 
> >   0xffffffff8084ecc8 <+56>:	mov    0x24(%rbx),%esi
> >   0xffffffff8084eccb <+59>:	mov    0x10(%rbx),%rdx
> >   0xffffffff8084eccf <+63>:	mov    %r14,%rdi
> >   0xffffffff8084ecd2 <+66>:	mov    %r15,%rcx
> 
> The main code ends up just +17 later at <+69>.  Fitting it all below <+64>
> would be better but is not necessary if the instructions are slow enough
> to all the instruction fetch to run ahead.  This depends on good prediction
> to not start too many instructions in paths that won't be followed.
> 
> >   0xffffffff8084ecd5 <+69>:	callq  0xffffffff80850000 <vn_close>
> >   0xffffffff8084ecda <+74>:	mov    %eax,%r15d
> >   0xffffffff8084ecdd <+77>:	mov    0x24(%rbx),%eax
> >   0xffffffff8084ece0 <+80>:	test   $0x40,%ah
> >   0xffffffff8084ece3 <+83>:	je     0xffffffff8084ed45 <vn_closefile+181>
> >   0xffffffff8084ece5 <+85>:	movzwl 0x20(%rbx),%eax
> >   0xffffffff8084ece9 <+89>:	cmp    $0x1,%eax
> >   0xffffffff8084ecec <+92>:	jne    0xffffffff8084ed45 <vn_closefile+181>
> >   0xffffffff8084ecee <+94>:	movw   $0x0,-0x22(%rbp)
> > [skip some parts, +181 marks the beginning of cleanup]
> >   0xffffffff8084ed52 <+194>:	retq
> 
> Bruce