svn commit: r252032 - head/sys/amd64/include

Fri Jun 21 13:54:37 UTC 2013

  Bruce,

On Fri, Jun 21, 2013 at 09:02:36PM +1000, Bruce Evans wrote:
B> Not if it is a 32-bit increment on 32-bit systems, as it should be.
B> 
B> I said to use a daemon to convert small (16 or 32 bit) counters into
B> larger (32 or 64 bit) ones.  It is almost as efficient to call the
B> accumulation code directly:
B> 
B>  	addl	%1,%fs:(%0)
B>  	jns	1f
B>  	call	counter_accum
B> 1:
B> 
B> This calls the accumulation code when the counter goes above 0x80000000.
B> The accumulation code then uses cmpxchg8b or a critical section to
B> atomically transfer everything in the small counter to the large
B> counter.  Note that the small counter remains valid at all times unless
B> it is incrememnted too fast.  Code that reports the counters must also
B> use cmpxch8b or a critical section to read them atomically before
B> summing them.  The transfer window between when the small counter
B> reaches 0x80000000 and when it overflows is very large provided large
B> increments are never made (2 increments of 0x80000000 would break it
B> immediately, but an average increment of 1 takes about a minute to
B> reach 0x80000000 even if called from a loop doing nothing except the
B> increment).  Only 32-bit increments that are not too large are
B> supported.  Negative increments are handed by a heuristic in the
B> accumulator.  E.g., an increment of -1 just works unless it causes the
B> small counter's value to go from 0 to 0xfffffffff.  0xffffffff is
B> detected by the sign test.  It must be interpreted as -1 to be added
B> to the large counter, not 0xffffffff.  The full heuristic would be
B> something like:
B> 
B>  	if (small_counter >= 0xf0000000) {
B>  		/* Subtract from large counter. */
B>  		/* Actually do this atomically, of course. */
B>  		large_counter += (int32_t)small_counter;
B>  		small_counter = 0;
B>  		/*
B>  		 * Add here sanity checks that the large counter never goes
B>  		 * negative.  This can only be afforded because this code
B>  		 * is rarely called (and is not inline).
B>  		 */
B>  	} else if (small_counter < 0x90000000) {
B>  		/* Add to large counter.
B>  		large_counter += small_counter;
B>  		small_counter = 0;
B>  	} else {
B>  		panic("transfer window too small");
B>  	}
B> 
B> > Entering critical section means modifying curthread, which is again
B> > a %gs based load & store. Exiting critical section has the same cost.
B> > Thus, we assume that doing a direct %gs based update on the counter
B> > is cheaper than critical_enter(); counter += foo; critical_exit();
B> 
B> Critical sections actually only modifiy td->td_critnest, so they do
B> a single %fs-based load and 2 %ds-based modifications (it's %fs for
B> i386 and %gs for amd64).  So it's far from clear that the direct
B> update is faster than the generic i386 code.  Avoiding the generic
B> i386 code also takes an extra load, test and branch.  The generic
B> code reduces to something like the following instructions after inlining
B> the critical section:
B> 
B>  	movl	%gs:CURTHREAD,%ebx
B>  	incl	TD_CRITNEST(%ebx)
B>  	movl	counter_arg,%edi
B>  	movl	inc_lo,%eax
B>  	movl	inc_hi,%edx
B>  	addl	%eax,%fs:(%edi)
B>  	adcl	%edx,%fs:4(%edi)
B> 1:
B>  	decl	TD_CRITNEST(%ebx)
B> 
B> while the non-generic code for the best case when cmpxchg8b is supported is
B> something like:
B> 
B>  	testl	$CPUID_CX8,cpu_feature
B>  	je	2f
B>  	movl	counter_arg,%esi
B>  	movl	ptr_to_inc,%edi	# might need more insns to init *ptr
B>  	// 8 instructions here from the asm string.
B>  	jmp	3f		# actually rearrange to jump in other case
B> 2:
B>  	// Use critical section.
B> 3:
B> 
B> 
B> So using cmpxchg8b takes at least 4 more instructions than using a
B> critical section.  The incl/decl instructions on td_critnest are rather
B> slow, however.  According to my test, they take 6 cycles each.  (I
B> think they can be reduced to load; modify; store; ... store back
B> original value which should take more like 9 cycles.)  This probably
B> applies to the addl and adcl that do the work too.  12 cycles for them,
B> and just 2 more cycles (pipelined) for cmpxchg8b and all the other
B> setup instructions.
B> 
B> Now, how does the performance aspect work?  Is it by cmpxch8b causing
B> less contention than direct addl to memory?  If so, then hopefully
B> plain cmpxchg does the same.
B> 
B> The non-cmpxchg version of the i386 code can probably be improved by
B> adding more instructions to it, to avoid doing the adcl to memory
B> in most cases.  Even if you support 64-bit increments, they will rarely
B> be used, so the branches for this will be predicted well:
B> 
B>  	// Next 4 instructions as above:
B>  	movl	counter_arg,%edi
B>  	movl	inc_lo,%eax
B>  	movl	inc_hi,%edx
B>  	addl	%eax,%fs:(%edi)
B>  	jnc	1f			# usually no carry
B>  	adcl	%edx,%fs:4(%edi)
B>  	jmp	3f
B> 1:
B>  	testl	%edx,%edx
B>  	je	3f			# usually no carry and no high bits
B>  	addl	%edx,%fs:4(%edi)
B> 
B> I wonder if cmpxch8b is smart enough to do this automatically, and that is
B> the reason it is faster.  It could avoid writing to the high word if the
B> high word doesn't change.  Modern x86 has write buffers that help it
B> do 64-bit write accesses in 32-bit mode no slower than in 64-bit mode, provided
B> the CPU can issue store instructions fast enough.  Caches do the same
B> thing for load instructions on not-so-modern x86.  But often, the CPU
B> can't issue load/store instructions fast enough.  I suspect that the
B> addl mem; adcl 4+mem sequence is a bad way to do a 64-bit increment, and
B> load 64 bits; modify; store 64 bits is better.  cmpxchg8b can also issue
B> a 64-bit store.
B> 
B> Testing gave the following results:
B> - jumping over adcl as above saves almost the full cost of an adcl
B> - load 64-bits, unconditional 64-bit add from memory to registers; store-64
B>    bits is about 25% faster than addl; adcl to memory.  Testing now on
B>    Athlon64.  addl and adcl to memory each take about 7 cycles; 14 for
B>    both; only 11 for doing the 64-bit add via registers.  Gcc is not
B>    smart about this.  It generates addl; adcl to memory.  Your asm has
B>    to do the increment in registers for technical reasons, so it does
B>    better than gcc almost (?) accidentally.  But clang is smart about
B>    this.  It generates the larger code that goes through registers even
B>    with -mi386 where I doubt that it is better (original i386 had no
B>    cache and slow I-fetch, so I think it strongly preferred less
B>    instructions).  Your code for the critical section case doesn't use
B>    asm, so it is slower than necessary with gcc.  When you found that
B>    simple increments were slower in userland tests, was that with gcc
B>    or clang?
B> - jumping over adcl in the second method didn't seem to work.

May be you are right and an implementation with a "daemon" would be
more efficient for i386. But daemon isn't needed at all on 64-bit
platforms. So I doubt it is worth coding it.

Frankly speaking, we already do not consider the i386 as a platform to
build high performance solutions. The need for counter(9) in TCP/IP
stack was demonstrated on modern amd64 hardware. Thus, our motivation
was to make:

  - a lossless and faster implementation for amd64
  - a lossless implementation for !amd64
  - if we have an option to avoid critical section on some arch, good
    let's avoid it.

-- 
Totus tuus, Glebius.