svn commit: r252032 - head/sys/amd64/include

Fri Jun 21 06:49:10 UTC 2013

  Bruce,

On Fri, Jun 21, 2013 at 09:04:34AM +1000, Bruce Evans wrote:
B> >> The i386 version of the counter asm doesn't support the immediate
B> >> constraint for technical reasons.  64 bit counters are too large and
B> >> slow to use on i386, especially when they are implemented as they are
B> >> without races.
B> >
B> > Actual testing showed that it is only about twice as slow as a direct
B> > increment.  With the enclosed test program (a userland version hacked
B> > on a bit to avoid pcpu), on ref10-i386 the times are:
B> > - loop overhead:                                        1 cycle
B> > - direct unlocked increment of a uint32_t:              6 cycles
B> > - direct unlocked increment of a uint64_t:              7 cycles
B> > - non-inline function unlocked increment of a uint64_t: 7.5 cycles
B> > - counter_u64_add():                                   14 cycles
B> > - non-inline counter_u64_add():                        18 cycles
B> > ...
B> 
B> Actually enclosing the test program:
B> 
B> % #include <stdint.h>
B> % #include <stdio.h>
B> % 
B> % static inline void
B> % counter_64_inc_8b(volatile uint64_t *p, int64_t inc)
B> % {
B> % 
B> % 	__asm __volatile(
B> % 	"movl	%%ds:(%%esi),%%eax\n\t"
B> % 	"movl	%%ds:4(%%esi),%%edx\n"
B> % "1:\n\t"
B> % 	"movl	%%eax,%%ebx\n\t"
B> % 	"movl	%%edx,%%ecx\n\t"
B> % 	"addl	(%%edi),%%ebx\n\t"
B> % 	"adcl	4(%%edi),%%ecx\n\t"
B> % 	"cmpxchg8b %%ds:(%%esi)\n\t"
B> % 	"jnz	1b"
B> % 	:
B> % 	: "S" (p), "D" (&inc)
B> % 	: "memory", "cc", "eax", "edx", "ebx", "ecx");
B> % }
B> % 
B> % uint32_t cpu_feature = 1;
B> % 
B> % typedef volatile uint64_t *counter_u64_t;
B> % 
B> % static void
B> % #if 1
B> % inline
B> % #else
B> % __noinline
B> % #endif
B> % counter_u64_add(counter_u64_t c, int64_t inc)
B> % {
B> % 
B> % #if 1
B> % 	if ((cpu_feature & 1) == 1) {
B> % 		counter_64_inc_8b(c, inc);
B> % 	}
B> % #elif 0
B> % 	if ((cpu_feature & 1) == 1) {
B> % 		*c += inc;
B> % 	}
B> % #else
B> % 	*c += inc;
B> % #endif
B> % }
B> % 
B> % uint64_t mycounter[1];
B> % 
B> % int
B> % main(void)
B> % {
B> % 	unsigned i;
B> % 
B> % 	for (i = 0; i < 1861955704; i++)	/* sysctl -n machdep.tsc_freq */
B> % 		counter_u64_add(mycounter, 1);
B> % 	printf("%ju\n", (uintmax_t)mycounter[0]);
B> % }

Yes, for a single threaded userland program using "+=" is faster than
all the magic that counter(9) does.

But when multiple threads need to access one counter "+=" fails both
with precision and with performance.

Using "+=" upon a per-CPU counter is racy, since += is compiled into
"load", "increment", "store" sequence and if we are not in a critical
section, then this is racy. We might be removed from CPU between load
and store.

Entering critical section means modifying curthread, which is again
a %gs based load & store. Exiting critical section has the same cost.
Thus, we assume that doing a direct %gs based update on the counter
is cheaper than critical_enter(); counter += foo; critical_exit();

-- 
Totus tuus, Glebius.