svn commit: r252032 - head/sys/amd64/include

Fri Jun 21 11:09:04 UTC 2013

On Fri, 21 Jun 2013, Gleb Smirnoff wrote:

> On Fri, Jun 21, 2013 at 09:04:34AM +1000, Bruce Evans wrote:
> B> >> The i386 version of the counter asm doesn't support the immediate
> B> >> constraint for technical reasons.  64 bit counters are too large and
> B> >> slow to use on i386, especially when they are implemented as they are
> B> >> without races.
> B> >
> B> > Actual testing showed that it is only about twice as slow as a direct
> B> > increment.  With the enclosed test program (a userland version hacked
> B> > on a bit to avoid pcpu), on ref10-i386 the times are:
> ...
> Yes, for a single threaded userland program using "+=" is faster than
> all the magic that counter(9) does.
>
> But when multiple threads need to access one counter "+=" fails both
> with precision and with performance.

No.  But I forgot about the performance aspect.

> Using "+=" upon a per-CPU counter is racy, since += is compiled into
> "load", "increment", "store" sequence and if we are not in a critical
> section, then this is racy. We might be removed from CPU between load
> and store.

Not if it is a 32-bit increment on 32-bit systems, as it should be.

I said to use a daemon to convert small (16 or 32 bit) counters into
larger (32 or 64 bit) ones.  It is almost as efficient to call the
accumulation code directly:

 	addl	%1,%fs:(%0)
 	jns	1f
 	call	counter_accum
1:

This calls the accumulation code when the counter goes above 0x80000000.
The accumulation code then uses cmpxchg8b or a critical section to
atomically transfer everything in the small counter to the large
counter.  Note that the small counter remains valid at all times unless
it is incrememnted too fast.  Code that reports the counters must also
use cmpxch8b or a critical section to read them atomically before
summing them.  The transfer window between when the small counter
reaches 0x80000000 and when it overflows is very large provided large
increments are never made (2 increments of 0x80000000 would break it
immediately, but an average increment of 1 takes about a minute to
reach 0x80000000 even if called from a loop doing nothing except the
increment).  Only 32-bit increments that are not too large are
supported.  Negative increments are handed by a heuristic in the
accumulator.  E.g., an increment of -1 just works unless it causes the
small counter's value to go from 0 to 0xfffffffff.  0xffffffff is
detected by the sign test.  It must be interpreted as -1 to be added
to the large counter, not 0xffffffff.  The full heuristic would be
something like:

 	if (small_counter >= 0xf0000000) {
 		/* Subtract from large counter. */
 		/* Actually do this atomically, of course. */
 		large_counter += (int32_t)small_counter;
 		small_counter = 0;
 		/*
 		 * Add here sanity checks that the large counter never goes
 		 * negative.  This can only be afforded because this code
 		 * is rarely called (and is not inline).
 		 */
 	} else if (small_counter < 0x90000000) {
 		/* Add to large counter.
 		large_counter += small_counter;
 		small_counter = 0;
 	} else {
 		panic("transfer window too small");
 	}

> Entering critical section means modifying curthread, which is again
> a %gs based load & store. Exiting critical section has the same cost.
> Thus, we assume that doing a direct %gs based update on the counter
> is cheaper than critical_enter(); counter += foo; critical_exit();

Critical sections actually only modifiy td->td_critnest, so they do
a single %fs-based load and 2 %ds-based modifications (it's %fs for
i386 and %gs for amd64).  So it's far from clear that the direct
update is faster than the generic i386 code.  Avoiding the generic
i386 code also takes an extra load, test and branch.  The generic
code reduces to something like the following instructions after inlining
the critical section:

 	movl	%gs:CURTHREAD,%ebx
 	incl	TD_CRITNEST(%ebx)
 	movl	counter_arg,%edi
 	movl	inc_lo,%eax
 	movl	inc_hi,%edx
 	addl	%eax,%fs:(%edi)
 	adcl	%edx,%fs:4(%edi)
1:
 	decl	TD_CRITNEST(%ebx)

while the non-generic code for the best case when cmpxchg8b is supported is
something like:

 	testl	$CPUID_CX8,cpu_feature
 	je	2f
 	movl	counter_arg,%esi
 	movl	ptr_to_inc,%edi	# might need more insns to init *ptr
 	// 8 instructions here from the asm string.
 	jmp	3f		# actually rearrange to jump in other case
2:
 	// Use critical section.
3:

So using cmpxchg8b takes at least 4 more instructions than using a
critical section.  The incl/decl instructions on td_critnest are rather
slow, however.  According to my test, they take 6 cycles each.  (I
think they can be reduced to load; modify; store; ... store back
original value which should take more like 9 cycles.)  This probably
applies to the addl and adcl that do the work too.  12 cycles for them,
and just 2 more cycles (pipelined) for cmpxchg8b and all the other
setup instructions.

Now, how does the performance aspect work?  Is it by cmpxch8b causing
less contention than direct addl to memory?  If so, then hopefully
plain cmpxchg does the same.

The non-cmpxchg version of the i386 code can probably be improved by
adding more instructions to it, to avoid doing the adcl to memory
in most cases.  Even if you support 64-bit increments, they will rarely
be used, so the branches for this will be predicted well:

 	// Next 4 instructions as above:
 	movl	counter_arg,%edi
 	movl	inc_lo,%eax
 	movl	inc_hi,%edx
 	addl	%eax,%fs:(%edi)
 	jnc	1f			# usually no carry
 	adcl	%edx,%fs:4(%edi)
 	jmp	3f
1:
 	testl	%edx,%edx
 	je	3f			# usually no carry and no high bits
 	addl	%edx,%fs:4(%edi)

I wonder if cmpxch8b is smart enough to do this automatically, and that is
the reason it is faster.  It could avoid writing to the high word if the
high word doesn't change.  Modern x86 has write buffers that help it
do 64-bit write accesses in 32-bit mode no slower than in 64-bit mode, provided
the CPU can issue store instructions fast enough.  Caches do the same
thing for load instructions on not-so-modern x86.  But often, the CPU
can't issue load/store instructions fast enough.  I suspect that the
addl mem; adcl 4+mem sequence is a bad way to do a 64-bit increment, and
load 64 bits; modify; store 64 bits is better.  cmpxchg8b can also issue
a 64-bit store.

Testing gave the following results:
- jumping over adcl as above saves almost the full cost of an adcl
- load 64-bits, unconditional 64-bit add from memory to registers; store-64
   bits is about 25% faster than addl; adcl to memory.  Testing now on
   Athlon64.  addl and adcl to memory each take about 7 cycles; 14 for
   both; only 11 for doing the 64-bit add via registers.  Gcc is not
   smart about this.  It generates addl; adcl to memory.  Your asm has
   to do the increment in registers for technical reasons, so it does
   better than gcc almost (?) accidentally.  But clang is smart about
   this.  It generates the larger code that goes through registers even
   with -mi386 where I doubt that it is better (original i386 had no
   cache and slow I-fetch, so I think it strongly preferred less
   instructions).  Your code for the critical section case doesn't use
   asm, so it is slower than necessary with gcc.  When you found that
   simple increments were slower in userland tests, was that with gcc
   or clang?
- jumping over adcl in the second method didn't seem to work.

Bruce