svn commit: r336025 - in head/sys: amd64/include i386/include

Fri Jul 6 17:55:49 UTC 2018

On Fri, 6 Jul 2018, John Baldwin wrote:

> On 7/6/18 8:52 AM, Rodney W. Grimes wrote:
>> ...
>> Trivial to fix this with
>> +#if defined(SMP) || !defined(_KERNEL) || defined(KLD_MODULE) || !defined(KLD_UP_MODULES)
>
> This is not worth it.  Note that we already use LOCK always in userland
> which is probably far more prevalent than the use in modules.
>
> Previously atomics in modules were _function calls_ just to avoid the LOCK.
> Having the LOCK prefix present even on UP is probably far more efficient
> than a function call.

No, the lock prefix is less efficient.

IIRC, on very old systems (~PPro), lock prefixes cost 20 cycles in the UP
case.  On AthlonXP, they cost about 19 cycles, but function calls (written
in C) only cost about 6 cycles.  This depends on pipelining, and my
test is perhaps too simple since it uses a loop where the pipelinig
works especially well (it executes 2 or 3 function calls in parallel).

Actually timing on AthlonXP UP:
- asm loop: 2 cycles/iteration
- "incl mem" in asm loop: 5.85 cycles (but with less alignment, only 3.25
   cycles)
- "lock; incl mem" in asm loop: 18.9 cycles
- function call in C loop to C function doing "incl mem" in asm: 8.35 cycles
- function call in C loop to C function doing "lock; incl mem" in asm: 24.95
   cycles.

Newer CPUs have better pipelining.  On Haswell, this gives the strange
behaviour that the function call written in C is slightly faster than
inline code written in asm:

Actual timing on Haswell SMP:
- asm loop: 1.16 cycles/iteration
- "incl mem" in asm loop: 6.95 cycles
- "lock; incl mem" in asm loop: 19.00 cycles
- function call in C loop to C function doing "incl mem" in asm: 6 cycles
- function call in C loop to C function doing "lock; incl mem" in asm: 26.00
   cycles.

The C code with the function call executes:

loop:
 	call	incl
 	incl:
 		pushl	%ebp
 		movl	%ebp,%esp
 		[lock;] incl mem
 		leave
 		ret
 	incl	%ebx
 	cmpl	$4080000000-1,%ebx
 	jbe	done

I didn't even compile with -fframe-pointer or try clang which would do
excessive unrolling.  -fframe-pointer takes 3 extra instructions in
incl, but these take no extra time.

In non-benchmark use, there would be more args for the function call so
and the scheduling would be very different so the timing might be very
different.  I expect the function call would be insignificantly slower
except in micro-benchmarks,

Bruce