svn commit: r312600 - head/sys/kern

Sun Jan 22 17:10:33 UTC 2017

On Mon, Jan 23, 2017 at 04:00:19AM +1100, Bruce Evans wrote:
> On Sun, 22 Jan 2017, Konstantin Belousov wrote:
> 
> > On Sun, Jan 22, 2017 at 11:41:09PM +1100, Bruce Evans wrote:
> >> On Sun, 22 Jan 2017, Mateusz Guzik wrote:
> >>> ...
> >>> I have to disagree about the usefulness remark. If you check generated
> >>> assembly for amd64 (included below), you will see the uncommon code is
> >>> moved out of the way and in particular there are no forward jumps in the
> >>> common case.
> >>
> >> Check benchmarks.  A few cycles here and there are in the noise.  Kernel
> >> code has very few possibilities for useful optimizations since it doesn't
> >> have many inner loops.
> >>
> >>> With __predict_false:
> >>>
> >>> [snip prologue]
> >>>   0xffffffff8084ecaf <+31>:	mov    0x24(%rbx),%eax
> >>>   0xffffffff8084ecb2 <+34>:	test   $0x40,%ah
> >>>   0xffffffff8084ecb5 <+37>:	jne    0xffffffff8084ece2 <vn_closefile+82>
> >>
> >> All that this does is as you say -- invert the condition to jump to the
> >> uncommon code.  This made more of a difference on old CPUs (possiblly
> >> still on low end/non-x86).  Now branch predictors are too good for the
> >> slow case to be much slower.
> >>
> >> I think x86 branch predictors initially predict forward branches as not
> >> taken and backward branches as taken.  So this branch was initially
> >> mispredicted, and the change fixes this.  But the first branch doesn't
> >> really matter.  Starting up takes hundreds or thousands of cycles for
> >> cache misses.
> > This is only true if branch predictor memory is large enough to keep the
> > state for the given branch between exercising it.  Even if the predictor
> > state could be attached to every byte in the icache, or more likely,
> > every line in the icache or uop cache, it still probably too small to
> > survive between user->kernel transitions for syscalls.  Might be there is
> > performance counter which shows branch predictor mis-predictions.
> >
> > In other words, I suspect that there almost all cases might be
> > mis-predictions without manual hint, and mis-predictions together with
> > the full pipeline flush on VFS-intensive load very well might give tens
> > percents of the total cycles on the modern cores.
> >
> > Just speculation.
> 
> Check benchmarks.
> 
> I looked at the mis-prediction counts mainly for a networking micro-benchmark
> alsmost 10 years ago.  They seemed to be among the least of the performance
> problems (the main ones were general bloat and cache misses).  I think the
> branch-predictor caches on even 10-year old x86 are quite large, enough to
> hold tens or hundreds of syscalls.  Otherwise performance would be lower
> than it is.
> 
> Testing shows that the cache size is about 2048 on Athlon-XP.  I might be
> measuring just the size of the L1 Icache interacting with the branch
> predictor:
> 
> The program is for i386 and needs some editing:
> 
> X int
> X main(void)
> X {
> X 	asm("				\n\
> X 	pushal				\n\
> X 	movl	$192105,%edi		\n\
> 
> Set this to $(sysctl -n machdep.tsc_freq) / 10000 to count cycles easily.
> 
> X 1:					\n\
> X 	# beware of clobbering in padding	\n\
> X 	pushal				\n\
> X 	xorl	%eax,%eax		\n\
> X 	# repeat next many times, e.g., 2047 times on Athlon-xp	\n\
> X 	jz	2f; .p2align 3; 2:	\n\
> 
> With up to 2048 branches, each branch takes 2 cycles on Athlon-XP.
> After that, each branch takes 10.8 cycles.
> 
> I don't understand why the alignment is needed, but without it each branch
> takes 9 cycles instead of 2 starting with just 2 jz's.
My guess this is the predictor graining issue I alluded to earlier.
E.g. Agner Fog' manuals state that for K8/K10,
==============================
The branch prediction mechanism allows no more than three taken branches
for every aligned 16-byte block of code.
==============================

The benchmark does not check the cost of misprediction, since the test
consists only of the thread of branches, there is no speculative state
to unwind.
> 
> "jmp" branches are not predicted any better than the always-taken "jz"
> braches.  Alignment is needed similarly.
> 
> Change "jz" to "jnz" to see the speed with branches never taken.  This
> takes 2 cycles for any number of branches up to 8K when the L1 Icache
> runs out.  Now the default prediction of not-taken is correct, so there
> are no mispredictions.
> 
> The alignment costs 0.5 cycles with a small number of jnz's and 0.03
> cycles with a large number of jz's or jmp's.  It helps with a large
> number of jnz's.
> 
> X 	popal				\n\
> X 	decl	%edi			\n\
> X 	jne	1b			\n\
> X 	popal				\n\
> X 	");
> X 	return (0);
> X }
> 
> Timing on Haswell:
> - Haswell only benefits slightly from the alignment and reaches full
>    speed with ".p2align 2"
> - 1 cycle instead of 2 for branch-not-taken
> - 2.1 cycles instead of 2 minimum for branch-taken
> - predictor cache size 4K instead of 2K
> - 12 cycles instead of 10.8 for branches mispredicted by the default for
>    more than 4K jz's.
> 
> Bruce