svn commit: r312600 - head/sys/kern
Konstantin Belousov
kostikbel at gmail.com
Sun Jan 22 17:10:33 UTC 2017
On Mon, Jan 23, 2017 at 04:00:19AM +1100, Bruce Evans wrote:
> On Sun, 22 Jan 2017, Konstantin Belousov wrote:
>
> > On Sun, Jan 22, 2017 at 11:41:09PM +1100, Bruce Evans wrote:
> >> On Sun, 22 Jan 2017, Mateusz Guzik wrote:
> >>> ...
> >>> I have to disagree about the usefulness remark. If you check generated
> >>> assembly for amd64 (included below), you will see the uncommon code is
> >>> moved out of the way and in particular there are no forward jumps in the
> >>> common case.
> >>
> >> Check benchmarks. A few cycles here and there are in the noise. Kernel
> >> code has very few possibilities for useful optimizations since it doesn't
> >> have many inner loops.
> >>
> >>> With __predict_false:
> >>>
> >>> [snip prologue]
> >>> 0xffffffff8084ecaf <+31>: mov 0x24(%rbx),%eax
> >>> 0xffffffff8084ecb2 <+34>: test $0x40,%ah
> >>> 0xffffffff8084ecb5 <+37>: jne 0xffffffff8084ece2 <vn_closefile+82>
> >>
> >> All that this does is as you say -- invert the condition to jump to the
> >> uncommon code. This made more of a difference on old CPUs (possiblly
> >> still on low end/non-x86). Now branch predictors are too good for the
> >> slow case to be much slower.
> >>
> >> I think x86 branch predictors initially predict forward branches as not
> >> taken and backward branches as taken. So this branch was initially
> >> mispredicted, and the change fixes this. But the first branch doesn't
> >> really matter. Starting up takes hundreds or thousands of cycles for
> >> cache misses.
> > This is only true if branch predictor memory is large enough to keep the
> > state for the given branch between exercising it. Even if the predictor
> > state could be attached to every byte in the icache, or more likely,
> > every line in the icache or uop cache, it still probably too small to
> > survive between user->kernel transitions for syscalls. Might be there is
> > performance counter which shows branch predictor mis-predictions.
> >
> > In other words, I suspect that there almost all cases might be
> > mis-predictions without manual hint, and mis-predictions together with
> > the full pipeline flush on VFS-intensive load very well might give tens
> > percents of the total cycles on the modern cores.
> >
> > Just speculation.
>
> Check benchmarks.
>
> I looked at the mis-prediction counts mainly for a networking micro-benchmark
> alsmost 10 years ago. They seemed to be among the least of the performance
> problems (the main ones were general bloat and cache misses). I think the
> branch-predictor caches on even 10-year old x86 are quite large, enough to
> hold tens or hundreds of syscalls. Otherwise performance would be lower
> than it is.
>
> Testing shows that the cache size is about 2048 on Athlon-XP. I might be
> measuring just the size of the L1 Icache interacting with the branch
> predictor:
>
> The program is for i386 and needs some editing:
>
> X int
> X main(void)
> X {
> X asm(" \n\
> X pushal \n\
> X movl $192105,%edi \n\
>
> Set this to $(sysctl -n machdep.tsc_freq) / 10000 to count cycles easily.
>
> X 1: \n\
> X # beware of clobbering in padding \n\
> X pushal \n\
> X xorl %eax,%eax \n\
> X # repeat next many times, e.g., 2047 times on Athlon-xp \n\
> X jz 2f; .p2align 3; 2: \n\
>
> With up to 2048 branches, each branch takes 2 cycles on Athlon-XP.
> After that, each branch takes 10.8 cycles.
>
> I don't understand why the alignment is needed, but without it each branch
> takes 9 cycles instead of 2 starting with just 2 jz's.
My guess this is the predictor graining issue I alluded to earlier.
E.g. Agner Fog' manuals state that for K8/K10,
==============================
The branch prediction mechanism allows no more than three taken branches
for every aligned 16-byte block of code.
==============================
The benchmark does not check the cost of misprediction, since the test
consists only of the thread of branches, there is no speculative state
to unwind.
>
> "jmp" branches are not predicted any better than the always-taken "jz"
> braches. Alignment is needed similarly.
>
> Change "jz" to "jnz" to see the speed with branches never taken. This
> takes 2 cycles for any number of branches up to 8K when the L1 Icache
> runs out. Now the default prediction of not-taken is correct, so there
> are no mispredictions.
>
> The alignment costs 0.5 cycles with a small number of jnz's and 0.03
> cycles with a large number of jz's or jmp's. It helps with a large
> number of jnz's.
>
> X popal \n\
> X decl %edi \n\
> X jne 1b \n\
> X popal \n\
> X ");
> X return (0);
> X }
>
> Timing on Haswell:
> - Haswell only benefits slightly from the alignment and reaches full
> speed with ".p2align 2"
> - 1 cycle instead of 2 for branch-not-taken
> - 2.1 cycles instead of 2 minimum for branch-taken
> - predictor cache size 4K instead of 2K
> - 12 cycles instead of 10.8 for branches mispredicted by the default for
> more than 4K jz's.
>
> Bruce
More information about the svn-src-head
mailing list