svn commit: r252032 - head/sys/amd64/include

Tue Jun 25 02:45:45 UTC 2013

On Mon, 24 Jun 2013, Konstantin Belousov wrote:

> On Sun, Jun 23, 2013 at 07:57:57PM +1000, Bruce Evans wrote:
>> The case that can't be fixed by rereading the counters is when fetching
>> code runs in between the stores.  If the stores are on a another CPU
>> that is currently executing them, then we can keep checking that the
>> counters don't change for the maximum time that any pair of stores
>> take to complete.  This time is quite long and difficult to determine
>> (but it can be bounded by hard-disabling interrupts in the storer, and
>> limited by prefetching).  If the stores are on the same CPU, then we
>> have no good way to detect that another thread is in the middle of
>> them, or to switch back to that thread.  So we currently prevent this
>> case using slow methods.
>
> We are already explicit about the fetched value potentially not
> reflecting any real moment value, but indeed, having the fetch result
> off by 2^32 is not acceptable.
>
> We need atomic 64bit loads to fix this, which is provided by the same
> cmpxchg8b instruction on the 8-byte aligned elements.  Intel guarantees
> that 8-byte aligned loads and stores are atomic.
>
> The following is the prototype for the x86. The other 64bit
> architectures are handled exactly like amd64. For 32bit !x86 arches,
> i.e. arm, mips32 and powerpc32, some more investigation is needed.

That works of course, but is too complicated, machine-dependent and slow
for me.
    (I just checked all the pcpu.h implementations and found that
     all of them except amd64 and i386 still use the quick and racy
     PCPU_PTR(member) += (value) for PCPU_ADD().  This has 1 sure
     race and 1 compiler/machine-dependent race:
     - PCPU_PTR() is only usable if preemption (resulting in rescheduling
       to run on a different CPU) is prevented in the caller.  Otherwise,
       the pointer may become invalid immediately iafter it is initialized.
     - the compiler may generate a racy load-add-store operation for +=.
       Using PCPU_PTR() makes this a smaller problem than using a thread-
       local pointer.  It prevents the store going to a different CPU
       than the load.
     PCPU_GET() and PCPU_SET() have the first of these problems on all
     arches except amd64 and i386.  Even curthread wouldn't work in the
     SMP case if it has the default implementation of PCPU_GET(curthread).
     However, it has a non-default implementation on all arches except
     arm and mips.  So even curthread seems to be broken for arm and mips.)

    The non-optimized <machine/counter.h> avoids these problems using a
    critical section.  I'm not sure that works.  Anyway, counter.h shouldn't
    have to understand pcpu better than pcpu.h does in order to work.

    The arm implementation of atomic.h is also relevant and is especially
    interesting.  I looked at it after mdf@ pointed out ARM_RAS_START.
    This seems to only be used by arm in userland.  It is even larger and
    slower and more complicated than critical sections, so it doesn't seem
    to be useful for counters.  In the kernel, arm has several
    implementations depending on machine capabilities.  The ifdefs are
    hard to understand.  On some machine, it seems to use the equivalent
    of a cmpxchg loop.  On others, it just disables interrupts.  Disabling
    interrupts is presumably better than a critical section, and userland
    has to use ARM_RAS_START for some arches because neither disabling
    interrupts not critical sections are available in userland.  None
    of this care helps avoid the races in pcpu.h, since pcpu.h intentionally
    avoids using atomic ops.  None of this care helps avoid the races in
    counter.h or make counter.h efficient, since counter.h uses its own
    methods.  To reach the same quality as atomic.h, counter.h would have
    to copy half of the ifdefs and methods in atomic.h.

The slowness that I care about is only in the counter update.  Anything
complicated there will be slow.

My current best design:
- use ordinary mutexes to protect counter fetches in non-per-CPU contexts.
- use native-sized or always 32-bit counters.  Counter updates are done
   by a single addl on i386.  Fix pcpu.h on arches other than amd64 and
   i386 and use the same method as there.
- counter fetches add the native-sized pcpu counters to 64-bit non-pcpu
   counters, when the native-size counters are in danger of overflowing
   or always, under the mutex.  Transferring data uses an ordinary
   atomic_cmpset.  To avoid ifdefs, always use u_int pcpu counters.
   The 64-bit non-pcpu counters can easily be changed to pcpu counters
   if the API is extended to support pcpu data.
- run a daemon every few minutes to fetch all the counters, so that
   the native-sized counters are in no danger of overflowing on systems
   that don't run statistics programs often enough to fetch the counters
   to actually use.

> ...
> diff --git a/sys/kern/subr_counter.c b/sys/kern/subr_counter.c
> index a98ed40..3222881 100644
> --- a/sys/kern/subr_counter.c
> +++ b/sys/kern/subr_counter.c
> ...
> @@ -49,14 +51,8 @@ counter_u64_zero(counter_u64_t c)
> uint64_t
> counter_u64_fetch(counter_u64_t c)
> {
> -	uint64_t r;
> -	int i;
>
> -	r = 0;
> -	for (i = 0; i < mp_ncpus; i++)
> -		r += *(uint64_t *)((char *)c + sizeof(struct pcpu) * i);
> -
> -	return (r);
> +	return (counter_u64_fetch_inline(c));
> }

With my design:

 	extern struct mtx counter_locks[];
 	extern uint64_t counters[];
 	uint64_t r;
 	volatile u_int *p;
 	u_int v;
 	int cnum;

 	cnum = ctonum(c);
 	mtx_lock(&counter_locks[cnum]);	/* might not need 1 per counter */
 	r = counters[cnum];
 	for (i = 0; i < mp_ncpus; i++) {
 		p = (u_int *)((char *)c + sizeof(struct pcpu) * i);
 		v = *p;			/* don't care if it is stale */
 		if (v >= 0x80000000) {
 			/* Transfer only when above critical level. */
 			while (atomic_cmpset_rel_int(p, v, 0) == 0)
 				v = *p;	/* still don't care if it is stale */
 			counters[cnum] += v;
 		}
 		r += v;
 	}
 	mtx_unlock(&counter_locks[cnum]);
 	return (r);

Mutexes give some slowness in the fetching code, but fetches eare expected
to be relatively rare.

I added the complication to usually avoid atomic ops at the last minute.
The complication might not be worth it.

Mutexes can be avoided in the usual case too if there is an atomic op
to load the 64-bit counters, but I don't want to depend on that.  i386
already has atomic load_acq and store_rel for 64-bit accesses, but
these are not available on all arches and I consider their existence
on i386 to be a bug.  On i386. they use cmpxch8b if available, else
disable interrupts.  The store_rel one is never used, and the load_acq
one is used twice to handle a tiny problem with TSC frequencies >=
2**32 Hz.  You didn't even use them here.  It seems to be just a
micro-optimization to not use them (in the cmpxch8b case, it avoids
the lock prefix for the SMP case, and in the critical section case,
it allows putting the critical section around the whole loop).

Bruce