svn commit: r238907 - projects/calloutng/sys/kern
Attilio Rao
attilio at freebsd.org
Sat Nov 24 15:43:29 UTC 2012
On Thu, Nov 8, 2012 at 5:26 PM, Bruce Evans <brde at optusnet.com.au> wrote:
> On Fri, 2 Nov 2012, Attilio Rao wrote:
>
>> On 10/29/12, Bruce Evans <brde at optusnet.com.au> wrote:
>>>
>>> On Mon, 29 Oct 2012, Attilio Rao wrote:
>>>
>>>> Now that sched_pin()/sched_unpin() are fixed I would like to introduce
>>>> this new patch, making critical_enter()/critical_exit() inline:
>>>> http://www.freebsd.org/~attilio/inline_critical.patch
>>>
>>> ...
>>>
>>> My version goes the other way and uninlines mtx_lock_spin() and
>>> mtx_unlock_spin(). Then it inlines (open codes) critical_enter() and
>>> ...
>>
>>
>> So, I thought more about this and I think that inlining
>> critical_exit() is not really going to bring any benefit here but
>> bloat.
>> This is because nested critical sections are rare rather not, which
>
>
> Rather rare !not? :-)
>
>
>> means you will end up in doing the function call most of the time and
>> plus we have the possible pessimization introduced by the memory
>> clobber (__compiler_membar()) and as you note possible deficiency
>> caming from the branch misprediction*.
>
>
> This seems best.
>
> I see a point about the rareness of the branch in critical_exit().
> Not sure if it is the one you are making: since the nested case is
> rare, then the branch will normally be correctly predicted. If the
> function remains uninlined, then the branch still has a good chance
> of being correctly predicted. This depends on the nested case being
> so rare across all callers, that the non-nested case doesn't mess
> up the prediction by happening often. The branch predictor only
> has to maintain history for 1 branch for this. However, if the
> call is inlined and there are many callers, there might be too many
> to maintain history for them all.
Yes, that's basically the same conclusion I came up with.
It seems you are not opposed to this version of the patch.
I made some in-kernel benchmarks but they didn't really show a
performance improvements, bigger than 1-2%, which on SMP system is
basically accountable to thirdy-part effects.
However we already employ the same code for sched_pin() and I then
think that we can just commit this patch as-is.
>
>> I feel very unsure style-wise about my add to sys/systm.h because it
>> is already very messy that I cannot give a logical sense to that
>> chunk.
>
>
> It is less bad than most places. Some of the inlines in libkern.h
> are misplaced. However, the bitcount inlines in systm.h belong
> closer to libkern (or /dev/null). The split between systm.h and
> kernel.h is bogus. But systemy things like critical* don't
> belong near libkern.
>
>
>> * In theory we could use __predict_false() or similar but I don't
>> think this will really help as x86 doesn't have explicit instructions
>> to control branch prediction
>
>
> It can help as follows on x86:
> - modern x86 has a default for when the branch predictor is cold. The
> compiler should arrange the code so that the default matches the
> usual case. Compilers can either follow the code order or guess
> which branch is usual (-fpredict-branch-probability IIRC). The
> branches in critical_exit() and x86 spinlock_enter() seem to have
> a good order for the first, but might be mispredicted by the compiler.
> The branch in spinlock_exit() seems to be in a bad order for the first.
> This doesn't matter if the CPU's branch predictor stays warm.
> - compilers can help the CPU's branch predictor and instruction prefetcher
> by moving the unusual case far away. This helps mainly for large code.
> The code in critical_exit() is not large, except possibly if it is
> inlined,
>
> But __predict_false() never did much for me, perhaps because I test mainly
> non-large code in loops where branch predictors stay warm.
Shouldn't large code benefit less likely by correct branch prediction?
As the pipeline are not huge wrongly predicting a branch should have
no effect if the jump label is far enough to not be pre-fetched.
Anyway, I never saw __predict_false() giving a measurable performance
bump on x86, neither x86 has explicit instructions to control it.
That's what I wanted to actually say.
Attilio
--
Peace can only be achieved by understanding - A. Einstein
More information about the svn-src-projects
mailing list