svn commit: r312600 - head/sys/kern

Sun Jan 22 17:00:28 UTC 2017

On Sun, 22 Jan 2017, Konstantin Belousov wrote:

> On Sun, Jan 22, 2017 at 11:41:09PM +1100, Bruce Evans wrote:
>> On Sun, 22 Jan 2017, Mateusz Guzik wrote:
>>> ...
>>> I have to disagree about the usefulness remark. If you check generated
>>> assembly for amd64 (included below), you will see the uncommon code is
>>> moved out of the way and in particular there are no forward jumps in the
>>> common case.
>>
>> Check benchmarks.  A few cycles here and there are in the noise.  Kernel
>> code has very few possibilities for useful optimizations since it doesn't
>> have many inner loops.
>>
>>> With __predict_false:
>>>
>>> [snip prologue]
>>>   0xffffffff8084ecaf <+31>:	mov    0x24(%rbx),%eax
>>>   0xffffffff8084ecb2 <+34>:	test   $0x40,%ah
>>>   0xffffffff8084ecb5 <+37>:	jne    0xffffffff8084ece2 <vn_closefile+82>
>>
>> All that this does is as you say -- invert the condition to jump to the
>> uncommon code.  This made more of a difference on old CPUs (possiblly
>> still on low end/non-x86).  Now branch predictors are too good for the
>> slow case to be much slower.
>>
>> I think x86 branch predictors initially predict forward branches as not
>> taken and backward branches as taken.  So this branch was initially
>> mispredicted, and the change fixes this.  But the first branch doesn't
>> really matter.  Starting up takes hundreds or thousands of cycles for
>> cache misses.
> This is only true if branch predictor memory is large enough to keep the
> state for the given branch between exercising it.  Even if the predictor
> state could be attached to every byte in the icache, or more likely,
> every line in the icache or uop cache, it still probably too small to
> survive between user->kernel transitions for syscalls.  Might be there is
> performance counter which shows branch predictor mis-predictions.
>
> In other words, I suspect that there almost all cases might be
> mis-predictions without manual hint, and mis-predictions together with
> the full pipeline flush on VFS-intensive load very well might give tens
> percents of the total cycles on the modern cores.
>
> Just speculation.

Check benchmarks.

I looked at the mis-prediction counts mainly for a networking micro-benchmark
alsmost 10 years ago.  They seemed to be among the least of the performance
problems (the main ones were general bloat and cache misses).  I think the
branch-predictor caches on even 10-year old x86 are quite large, enough to
hold tens or hundreds of syscalls.  Otherwise performance would be lower
than it is.

Testing shows that the cache size is about 2048 on Athlon-XP.  I might be
measuring just the size of the L1 Icache interacting with the branch
predictor:

The program is for i386 and needs some editing:

X int
X main(void)
X {
X 	asm("				\n\
X 	pushal				\n\
X 	movl	$192105,%edi		\n\

Set this to $(sysctl -n machdep.tsc_freq) / 10000 to count cycles easily.

X 1:					\n\
X 	# beware of clobbering in padding	\n\
X 	pushal				\n\
X 	xorl	%eax,%eax		\n\
X 	# repeat next many times, e.g., 2047 times on Athlon-xp	\n\
X 	jz	2f; .p2align 3; 2:	\n\

With up to 2048 branches, each branch takes 2 cycles on Athlon-XP.
After that, each branch takes 10.8 cycles.

I don't understand why the alignment is needed, but without it each branch
takes 9 cycles instead of 2 starting with just 2 jz's.

"jmp" branches are not predicted any better than the always-taken "jz"
braches.  Alignment is needed similarly.

Change "jz" to "jnz" to see the speed with branches never taken.  This
takes 2 cycles for any number of branches up to 8K when the L1 Icache
runs out.  Now the default prediction of not-taken is correct, so there
are no mispredictions.

The alignment costs 0.5 cycles with a small number of jnz's and 0.03
cycles with a large number of jz's or jmp's.  It helps with a large
number of jnz's.

X 	popal				\n\
X 	decl	%edi			\n\
X 	jne	1b			\n\
X 	popal				\n\
X 	");
X 	return (0);
X }

Timing on Haswell:
- Haswell only benefits slightly from the alignment and reaches full
   speed with ".p2align 2"
- 1 cycle instead of 2 for branch-not-taken
- 2.1 cycles instead of 2 minimum for branch-taken
- predictor cache size 4K instead of 2K
- 12 cycles instead of 10.8 for branches mispredicted by the default for
   more than 4K jz's.

Bruce