svn commit: r252032 - head/sys/amd64/include

Tue Jun 25 10:14:52 UTC 2013

On Tue, 25 Jun 2013, Konstantin Belousov wrote:

> On Tue, Jun 25, 2013 at 12:45:36PM +1000, Bruce Evans wrote:
>> On Mon, 24 Jun 2013, Konstantin Belousov wrote:
> ...
>>> The following is the prototype for the x86. The other 64bit
>>> architectures are handled exactly like amd64. For 32bit !x86 arches,
>>> i.e. arm, mips32 and powerpc32, some more investigation is needed.
>>
>> That works of course, but is too complicated, machine-dependent and slow
>> for me.
>> ...
>>     The non-optimized <machine/counter.h> avoids these problems using a
>>     critical section.  I'm not sure that works.  Anyway, counter.h shouldn't
>>     have to understand pcpu better than pcpu.h does in order to work.
> Updates to the counter cannot be done from the interrupt context.

This is fragile, however.  It prevents using counters for things like
counting interrupts.  Most interrupt counting is now done directlyly
and doesn't use PCPU_INC().  i386/i386 has just 1 use of PCPU_INC().
It is to count traps in machdep.c.  Traps include nonmaskable
interrupts.  Even hard-disabling of interrupts doesn't prevent these.
Otherwise, PCPU is used mainly for vm counters.  E.g., for pagefaults.
Now the trap is not an interrupt, so it shouldn't occur in the middle
of the counter update and the PCPU_INC() can safely be replaced by
a counter, but this is not clear.

> The fetch problem of sometime missing the carry bit is present on all
> 32bit arches now, I think.

Also the zeroing problem.  Since there are no atomic ops in sight, zeroing
has several races.  My transferring code uses atomic ops on only 1 CPU,
so it has half of these races.  See below.

>>     The arm implementation of atomic.h is also relevant and is especially
>>     interesting.  I looked at it after mdf@ pointed out ARM_RAS_START.
>>     This seems to only be used by arm in userland.  It is even larger and
>>     slower and more complicated than critical sections, so it doesn't seem
>>     to be useful for counters.  In the kernel, arm has several
>>     implementations depending on machine capabilities.  The ifdefs are
>>     hard to understand.  On some machine, it seems to use the equivalent
>>     of a cmpxchg loop.  On others, it just disables interrupts.  Disabling
>>     interrupts is presumably better than a critical section, and userland
>>     has to use ARM_RAS_START for some arches because neither disabling
>>     interrupts not critical sections are available in userland.  None
>>     of this care helps avoid the races in pcpu.h, since pcpu.h intentionally
>>     avoids using atomic ops.  None of this care helps avoid the races in
>>     counter.h or make counter.h efficient, since counter.h uses its own
>>     methods.  To reach the same quality as atomic.h, counter.h would have
>>     to copy half of the ifdefs and methods in atomic.h.
> BTW, why do you claim that disabling interrupts is faster and better method
> to disable preemption than critical section ?

For arm, it is because if the critical section method were better, then
arm should use it for atomic ops too.

For i386, it is because it reduces the window in which the value is invalid
(other CPUs can see the invalid value), and is easier to use and shorter
(2-3 bytes) and thus better suited to inline code.  My examples of inline
code mostly used it in rarely-executed cases where its speed doesn't matter
and it is just easier to use.  Actual testing shows that it is fairly
efficient, so can be used in all cases.  On Athlon64, it is faster than
cmpxchg8b but slower than an inline optimized critical section.  This
testing is in userland, where there may be an extra penalty for using
cli/sti/pushfl/popfl:

% static inline void
% counter_u64_add(counter_u64_t c, int64_t inc)
% {
% 
% #if 1
% 	/* 20 cycles on A64. */
% 	/* 31 cycles on A64 with lock prefix. */
% 	/* 35 cycles on A64 with mfence prefix. */
% 	if ((cpu_feature & 1) == 1) {
% 		counter_64_inc_8b(c, inc);
% 	}

Old tests said that this takes 18 cycles.  It now takes 20.

Adding a lock or mfence prefix to the cmpxchg8b gives the times stated
above.  This shows how bad using cmpxch8b is.  We could use atomic ops
for everything and only be 50% slower.  But we shouldn't use either
cmpxchg8b or atomic ops.  We should use 32-bit addl, which takes about
4 cycles.

% struct thread
% {
% 	uint32_t td_critnest;
% 	uint32_t td_owepreempt;
% } td0;
% 
% volatile struct thread *td = &td0;
% 
% void
% dummy_accum(void)
% {
% }
% 
% void
% fake_mi_switch(void)
% {
% }

Added more emulation for a complete critical section emulation.

% static inline void
% alt_counter_u64_add(counter_u64_t c, int64_t inc)
% {
% #if 0
% 	/* 8.5 cycles on A64. */
% 	td->td_critnest++;
% 	__asm __volatile("addl %1,%%ds:%0" : "=m,m" (*c) : "?i,r" (inc));
% 	td->td_critnest++;
% #elif 0
% 	/* 7 cycles on A64. */
% 	uint32_t ocritnest;
% 
% 	ocritnest = td->td_critnest;
% 	td->td_critnest = ocritnest + 1;
% 	__asm __volatile("addl %1,%%ds:%0" : "=m,m" (*c) : "?i,r" (inc));
% 	td->td_critnest = ocritnest;

Old too-simple critical section emulation.

% #elif 0
% 	/* 11 cycles on A64. */
% 	uint32_t ocritnest;
% 
% 	ocritnest = td->td_critnest;
% 	td->td_critnest = ocritnest + 1;
% 	__asm __volatile("addl %1,%%ds:%0" : "=m,m" (*c) : "?i,r" (inc));
% 	if (ocritnest == 0) {
% 		td->td_critnest = ocritnest;
% 		if (td->td_owepreempt)
% 			fake_mi_switch();
% 	} else
% 		td->td_critnest = ocritnest;

Full critical section emulation.  Everything inlined and optimized.  With
the -current extern critical section functions, this would be much slower.
Probably in between the slowness of unlocked cmpxchg8b and locked cmpxch8b.

% #elif 0
% 	/* 13 cycles on A64. */
% 	uint32_t ocritnest;
% 
% 	ocritnest = td->td_critnest;
% 	td->td_critnest = ocritnest + 1;
% 	__asm __volatile("addl %%eax,%%ds:%0; adcl %%edx,4+%0"
% 	    : "=m" (*c) : "A" (inc));
% 	if (ocritnest == 0) {
% 		td->td_critnest = ocritnest;
% 		if (td->td_owepreempt)
% 			fake_mi_switch();
% 	} else
% 		td->td_critnest = ocritnest;

Same as above, except for a full 64-bit addition.  Unfortunately, the
extra adcl doesn't seem to be done in parallel, since it takes 2 cycles
longer.  The code is rather large and might be hard to fit well in
pipelines.

% #elif 0
% 	/* 9 cycles on A64. */
% 	__asm __volatile("addl %%eax,%%ds:%0; adcl %%edx,4+%0"
% 	    : "=m" (*c) : "A" (inc));

Another new test.  Do a full 64-bit add it a racy way.  This didn't
pipeline well either.  In minor variations of this on same A64 CPU,
I saw phenoma like a single addl taking 5.5 cycles but this addl
followed by an adcl $0 taking only 4 cycles.  4 cycles is what they
should take, and 2 instructions being faster than 1 makes some sense
since the 2 instructions fill a 64-bit word so the hardware will have
to do less combining of stores with existing contents of write buffers.

Oops, I forgot one %%ds prefix here and below.

% #elif 0
% 	/* 12 cycles on A64. */
% 	__asm __volatile("cli; addl %%eax,%%ds:%0; adcl %%edx,4+%0; sti"
% 	    : "=m" (*c) : "A" (inc));

"cli/sti" is very fast on Athlon64.  This is the fastest complete 64-bit
update method not using a daemon or special cases found so far for Athlon64.
cli/sti also took 12 cycles without the adcl.

% #elif 0
% 	/* 20 cycles on A64. */
% 	__asm __volatile(
% 	    "pushfl; cli; addl %%eax,%%ds:%0; adcl %%edx,4+%0; popfl"
% 	    : "=m" (*c) : "A" (inc));

Unfortunately, cli/sti is fragile, so we might need a reentrant switch
like this, and pushfl/popfl is quite slow.

% #elif 1
% 	/* 16 cycles on A64. */
% 	__asm __volatile(
% 	    "pushfl; cli; addl %%eax,%%ds:%0; adcl %%edx,4+%0; popl %%ecx; testl $0x200,%%ecx; je 8f; sti; 8:;"
% 	    : "=m" (*c) : "A" (inc) : "ecx");

Not doing the full popfl is not so slow.  Hopefully this is faster in the
kernel since it involves fewer privilege checks.

This method is 20% faster than the cmpx8chgb method.  On core2, the
cmpxchg8b medhod only takes 14 cycles due to not having a store-to-load
mismatch penalty, so it would be faster than disabling interrupts if
disabling interrupts has the same slowness.

Full test program at the end.

>> The slowness that I care about is only in the counter update.  Anything
>> complicated there will be slow.
>>
>> My current best design:
>> - use ordinary mutexes to protect counter fetches in non-per-CPU contexts.
>> - use native-sized or always 32-bit counters.  Counter updates are done
>>    by a single addl on i386.  Fix pcpu.h on arches other than amd64 and
>>    i386 and use the same method as there.
>> - counter fetches add the native-sized pcpu counters to 64-bit non-pcpu
>>    counters, when the native-size counters are in danger of overflowing
>>    or always, under the mutex.  Transferring data uses an ordinary
>>    atomic_cmpset.  To avoid ifdefs, always use u_int pcpu counters.
>>    The 64-bit non-pcpu counters can easily be changed to pcpu counters
>>    if the API is extended to support pcpu data.
> Mutex cannot be used for counters, I believe this was already discussed
> to the end.  The goal of the counters design is to avoid the unneccesary
> snoop traffic in the SMP system, mutex makes the hot line bouncing.
> This was well understood and benchmarked.

Yes it can.  See my code.  The mutex is only accessed infrequently.

>> - run a daemon every few minutes to fetch all the counters, so that
>>    the native-sized counters are in no danger of overflowing on systems
>>    that don't run statistics programs often enough to fetch the counters
>>    to actually use.
> Daemon have to
> 1. track all counters in some global registry
> 2. run separate thread on each cpu.

Not very difficult.

> I will add the missed MD bits to the current patch for i386, leaving
> the known issue for fetches on other 32bit arches for now.

Also leaving the known issue of zeroing being broken for all arches.
Using a daemon is a good way to fix this too.  You have to run the
daemon on each CPU and wait for it to complete whenever a counter
is zeroed.

>> With my design:
>>
>>  	extern struct mtx counter_locks[];
>>  	extern uint64_t counters[];
>>  	uint64_t r;
>>  	volatile u_int *p;
>>  	u_int v;
>>  	int cnum;
>>
>>  	cnum = ctonum(c);
>>  	mtx_lock(&counter_locks[cnum]);	/* might not need 1 per counter */
>>  	r = counters[cnum];
>>  	for (i = 0; i < mp_ncpus; i++) {
>>  		p = (u_int *)((char *)c + sizeof(struct pcpu) * i);
>>  		v = *p;			/* don't care if it is stale */
>>  		if (v >= 0x80000000) {
>>  			/* Transfer only when above critical level. */
>>  			while (atomic_cmpset_rel_int(p, v, 0) == 0)
>>  				v = *p;	/* still don't care if it is stale */
>>  			counters[cnum] += v;

This transfer was supposed to avoid the need for the daemon, but it has
much the same bugs as the zeroing code.  It only uses atomic ops on 1
CPU.  The other CPUs doing the update don't use atomic ops, so they
might not see the changes made here.

>>  		}
>>  		r += v;
>>  	}
>>  	mtx_unlock(&counter_locks[cnum]);
>>  	return (r);
>>
>> Mutexes give some slowness in the fetching code, but fetches eare expected
>> to be relatively rare.
> As I noted above, mutexes cause line sharing and snoop traffic.

Unimportant, since this code is rarely executed.

If its efficiency were important, then this API wouldn't be used at all.
Fetching a single counter at a time using the slow sysctl() API is very
inefficient.  Hundreds of times slower than reading thousands of (packed)
variables in 1 4K block from kmem.  The slowness was noticeable in some
statistics programs in 1993, but no one notices it now.

>> Mutexes can be avoided in the usual case too if there is an atomic op
>> to load the 64-bit counters, but I don't want to depend on that.  i386
>> already has atomic load_acq and store_rel for 64-bit accesses, but
>> these are not available on all arches and I consider their existence
>> on i386 to be a bug.  On i386. they use cmpxch8b if available, else
>> disable interrupts.  The store_rel one is never used, and the load_acq
>> one is used twice to handle a tiny problem with TSC frequencies >=
>> 2**32 Hz.  You didn't even use them here.  It seems to be just a
>> micro-optimization to not use them (in the cmpxch8b case, it avoids
>> the lock prefix for the SMP case, and in the critical section case,
>> it allows putting the critical section around the whole loop).
> Avoiding lock prefix is also essential part of the counter design.

Not important in the fetching code, since it is rarely executed.  In
the update code, avoiding it gains very little when it is replaced
by a slow method.

Current test program with ifdefs configured to time the modified
pushfl/cli;popfl method:

% #include <stdint.h>
% #include <stdio.h>
% 
% static inline void
% counter_64_inc_8b(volatile uint64_t *p, int64_t inc)
% {
% 
% 	__asm __volatile(
% 	"movl	%%ds:(%%esi),%%eax\n\t"
% 	"movl	%%ds:4(%%esi),%%edx\n"
% "1:\n\t"
% 	"movl	%%eax,%%ebx\n\t"
% 	"movl	%%edx,%%ecx\n\t"
% 	"addl	(%%edi),%%ebx\n\t"
% 	"adcl	4(%%edi),%%ecx\n\t"
% #ifdef LOCK
% 	"lock\n\t"
% #endif
% #ifdef MFENCE
% 	"mfence\n\t"
% #endif
% 	"cmpxchg8b %%ds:(%%esi)\n\t"
% 	"jnz	1b"
% 	:
% 	: "S" (p), "D" (&inc)
% 	: "memory", "cc", "eax", "edx", "ebx", "ecx");
% }
% 
% uint32_t cpu_feature = 1;
% 
% /*
%  * These shouldn't be volatile.  Volatile just avoids breaking the simulation
%  * by preventing optimizations that aren't possible for pcpu variables.
%  */
% typedef volatile uint64_t *counter_u64_t;
% typedef volatile uint32_t *counter_u32_t;
% 
% static inline void
% counter_u64_add(counter_u64_t c, int64_t inc)
% {
% 
% #if 1
% 	/* 20 cycles on A64. */
% 	/* 31 cycles on A64 with lock prefix. */
% 	/* 35 cycles on A64 with mfence prefix. */
% 	if ((cpu_feature & 1) == 1) {
% 		counter_64_inc_8b(c, inc);
% 	}
% #elif 0
% 	/* 5-8 cycles on A64 (gcc-4.2.1 much slower than gcc-3.3.3).. */
% 	if ((cpu_feature & 1) == 1) {
% 		*c += inc;
% 	}
% #else
% 	/* 4-7 cycles on A64. */
% 	*c += inc;
% #endif
% }
% 
% static inline void
% counter_u32_add(counter_u32_t c, int32_t inc)
% {
% 
% #if 1
% 	/* 5.5-7 cycles on A64. */
% 	__asm __volatile("addl %1,%%ds:%0" : "=m,m" (*c) : "?i,r" (inc));
% #else
% 	/* 4-7 cycles on A64. */
% 	*c += inc;
% #endif
% }
% 
% struct thread
% {
% 	uint32_t td_critnest;
% 	uint32_t td_owepreempt;
% } td0;
% 
% volatile struct thread *td = &td0;
% 
% void
% dummy_accum(void)
% {
% }
% 
% void
% fake_mi_switch(void)
% {
% }
% 
% static inline void
% alt_counter_u64_add(counter_u64_t c, int64_t inc)
% {
% #if 0
% 	/* 8.5 cycles on A64. */
% 	td->td_critnest++;
% 	__asm __volatile("addl %1,%%ds:%0" : "=m,m" (*c) : "?i,r" (inc));
% 	td->td_critnest++;
% #elif 0
% 	/* 7 cycles on A64. */
% 	uint32_t ocritnest;
% 
% 	ocritnest = td->td_critnest;
% 	td->td_critnest = ocritnest + 1;
% 	__asm __volatile("addl %1,%%ds:%0" : "=m,m" (*c) : "?i,r" (inc));
% 	td->td_critnest = ocritnest;
% #elif 0
% 	/* 11 cycles on A64. */
% 	uint32_t ocritnest;
% 
% 	ocritnest = td->td_critnest;
% 	td->td_critnest = ocritnest + 1;
% 	__asm __volatile("addl %1,%%ds:%0" : "=m,m" (*c) : "?i,r" (inc));
% 	if (ocritnest == 0) {
% 		td->td_critnest = ocritnest;
% 		if (td->td_owepreempt)
% 			fake_mi_switch();
% 	} else
% 		td->td_critnest = ocritnest;
% #elif 0
% 	/* 13 cycles on A64. */
% 	uint32_t ocritnest;
% 
% 	ocritnest = td->td_critnest;
% 	td->td_critnest = ocritnest + 1;
% 	__asm __volatile("addl %%eax,%%ds:%0; adcl %%edx,4+%0"
% 	    : "=m" (*c) : "A" (inc));
% 	if (ocritnest == 0) {
% 		td->td_critnest = ocritnest;
% 		if (td->td_owepreempt)
% 			fake_mi_switch();
% 	} else
% 		td->td_critnest = ocritnest;
% #elif 0
% 	/* 9 cycles on A64. */
% 	__asm __volatile("addl %%eax,%%ds:%0; adcl %%edx,4+%0"
% 	    : "=m" (*c) : "A" (inc));
% #elif 0
% 	/* 12 cycles on A64. */
% 	__asm __volatile("cli; addl %%eax,%%ds:%0; adcl %%edx,4+%0; sti"
% 	    : "=m" (*c) : "A" (inc));
% #elif 0
% 	/* 20 cycles on A64. */
% 	__asm __volatile(
% 	    "pushfl; cli; addl %%eax,%%ds:%0; adcl %%edx,4+%0; popfl"
% 	    : "=m" (*c) : "A" (inc));
% #elif 1
% 	/* 16 cycles on A64. */
% 	__asm __volatile(
% 	    "pushfl; cli; addl %%eax,%%ds:%0; adcl %%edx,4+%0; popl %%ecx; testl $0x200,%%ecx; je 8f; sti; 8:;"
% 	    : "=m" (*c) : "A" (inc) : "ecx");
% #elif 0
% 	/* Support only 32-bit non-negative increments. */
% 	/* 4-7 cycles on A64 (varies with gcc version, -O* and -static. */
% 	__asm __volatile("	\n\
% 	addl	%1,%%ds:%0	\n\
% 	jnc	8f		\n\
% 	call	dummy_accum	\n\
% 8: ;				\n\
% 	"
% 	: "=m" (*c) : "r" ((uint32_t)inc));
% #elif 0
% 	/* As in previous, but allow immediate operand. */
% 	/* 4-7 cycles on A64 (varies with gcc version, -O* and -static. */
% 	/*
% 	 * Using the immediate operand tends to be slower!
% 	 *
% 	 * But note that we don't use the memory clobber pessimization.
% 	 * This allows the inc operand (and the loop counter) to stay
% 	 * in registers across the call here, so the register operand
% 	 * is naturally faster.  gcc should prefer the register operand
% 	 * anyway in this case, but is not that smart.  It even prefers
% 	 * the register operand when the immediate operand is disparaged.
% 	 * Sometimes the optimization (just loop alignment?) accidentally
% 	 * compensates for making the worst choice.
% 	 */
% 	/* 5 cycles on A64. */
% 	__asm __volatile("	\n\
% 	addl	%1,%%ds:%0	\n\
% 	jnc	8f		\n\
% 	call	dummy_accum	\n\
% 8: ;				\n\
% 	"
% 	: "=m,m" (*c) : "?i,r" ((uint32_t)inc));
% #elif 0
% 	/* As in previous, but go through a register.  Neg. incr. now work. */
% 	/* 8.5-9 cycles on A64. */
% 	uint32_t exp, src;
% 
% 	__asm __volatile("	\n\
% 	movl	%%ds:%4,%1	\n\
% 1:				\n\
% 	movl	%1,%2		\n\
% 	addl	%3,%2		\n\
% 	jnc	7f		\n\
% 	call	dummy_accum	\n\
% 	jmp	8f		\n\
% 7:				\n\
% 	cmpxchgl %2,%%ds:%0	\n\
% 	jne	1b		\n\
% 8: ;				\n\
% 	"
% 	: "=m" (*c),		/* 0 */
% 	  "=a" (exp),		/* 1 */
% 	  "=r" (src)		/* 2 */
% 	: "ir" ((uint32_t)inc),	/* 3 */ /* don't try to disparage "i" */
% 	  "m" (*c)		/* 4 */
% 	);
% #else
% 	/* Full 64-bit version. */
% 	/* 8-9 cycles on A64. */
% 	uint32_t exp, src;
% 
% 	__asm __volatile("	\n\
% 	/* Runtime testing of %4 is wasteful if inc is constant. */	\n\
% 	/* Especially with the high register pressure. */		\n\
% 	testl	%4,%4		\n\
% 	je	1f		\n\
% 	call	dummy_accum	\n\
% 	jmp	8f		\n\
% 1:				\n\
% 	movl	%%ds:%5,%1	\n\
% 1:				\n\
% 	movl	%1,%2		\n\
% 	addl	%3,%2		\n\
% 	jnc	7f		\n\
% 	call	dummy_accum	\n\
% 	jmp	8f		\n\
% 7:				\n\
% 	cmpxchgl %2,%%ds:%0	\n\
% 	jne	1b		\n\
% 8: ;				\n\
% 	"
% 	: "=m" (*c),		/* 0 */
% 	  "=a" (exp),		/* 1 */
% 	  "=r" (src)		/* 2 */
% 	: "r" ((uint32_t)inc),	/* 3 */
% 	  "r" ((uint32_t)(inc >> 32)),	/* 4 */
% 	  "m" (*c)		/* 5 */
% 	);
% #endif
% }
% 
% uint64_t mycounter[1];
% uint32_t my32counter[1];
% 
% int
% main(void)
% {
% 	unsigned i;
% 
% 	open("/dev/io", 2);
% 	for (i = 0; i < 2010168339; i++)	/* sysctl -n machdep.tsc_freq */
% #if 0
% 		counter_u64_add(mycounter, 1);
% #elif 1
% 		alt_counter_u64_add(mycounter, 1);
% #else
% 		counter_u32_add(my32counter, 1);
% #endif
% 	printf("%ju\n", (uintmax_t)mycounter[0]);
% 	return (mycounter[0] == 2010168339 ? 0 : 1);
% }

I got some kernel panics in FreeBSD-~5.2 from errors in the cli/sti method.
After forgetting the sti in this program, the system hung for a while and
kernel eventually paniced for an unexpected trap with interrupts disabled.
We have unfinished mail exhanges about fixing this trap handling.  My
kernel is apparently too agressive about asserting that things can't
happen.

Bruce