svn commit: r252032 - head/sys/amd64/include

Tue Jun 25 06:20:48 UTC 2013

On Tue, Jun 25, 2013 at 12:45:36PM +1000, Bruce Evans wrote:
> On Mon, 24 Jun 2013, Konstantin Belousov wrote:
> 
> > On Sun, Jun 23, 2013 at 07:57:57PM +1000, Bruce Evans wrote:
> >> The case that can't be fixed by rereading the counters is when fetching
> >> code runs in between the stores.  If the stores are on a another CPU
> >> that is currently executing them, then we can keep checking that the
> >> counters don't change for the maximum time that any pair of stores
> >> take to complete.  This time is quite long and difficult to determine
> >> (but it can be bounded by hard-disabling interrupts in the storer, and
> >> limited by prefetching).  If the stores are on the same CPU, then we
> >> have no good way to detect that another thread is in the middle of
> >> them, or to switch back to that thread.  So we currently prevent this
> >> case using slow methods.
> >
> > We are already explicit about the fetched value potentially not
> > reflecting any real moment value, but indeed, having the fetch result
> > off by 2^32 is not acceptable.
> >
> > We need atomic 64bit loads to fix this, which is provided by the same
> > cmpxchg8b instruction on the 8-byte aligned elements.  Intel guarantees
> > that 8-byte aligned loads and stores are atomic.
> >
> > The following is the prototype for the x86. The other 64bit
> > architectures are handled exactly like amd64. For 32bit !x86 arches,
> > i.e. arm, mips32 and powerpc32, some more investigation is needed.
> 
> That works of course, but is too complicated, machine-dependent and slow
> for me.
>     (I just checked all the pcpu.h implementations and found that
>      all of them except amd64 and i386 still use the quick and racy
>      PCPU_PTR(member) += (value) for PCPU_ADD().  This has 1 sure
>      race and 1 compiler/machine-dependent race:
>      - PCPU_PTR() is only usable if preemption (resulting in rescheduling
>        to run on a different CPU) is prevented in the caller.  Otherwise,
>        the pointer may become invalid immediately iafter it is initialized.
>      - the compiler may generate a racy load-add-store operation for +=.
>        Using PCPU_PTR() makes this a smaller problem than using a thread-
>        local pointer.  It prevents the store going to a different CPU
>        than the load.
>      PCPU_GET() and PCPU_SET() have the first of these problems on all
>      arches except amd64 and i386.  Even curthread wouldn't work in the
>      SMP case if it has the default implementation of PCPU_GET(curthread).
>      However, it has a non-default implementation on all arches except
>      arm and mips.  So even curthread seems to be broken for arm and mips.)
Brokeness (or, from other PoV, a restriction on use of PCPU_GET/SET/INC)
is well-known. For ARM, I believe the in-tree kernels do not support SMP
at all, so curthread as is is not broken. For mips it seems as if the
pcpup located on the wired page which mapping is context-switched, that
should make the curthread functional.

> 
>     The non-optimized <machine/counter.h> avoids these problems using a
>     critical section.  I'm not sure that works.  Anyway, counter.h shouldn't
>     have to understand pcpu better than pcpu.h does in order to work.
Updates to the counter cannot be done from the interrupt context.
The fetch problem of sometime missing the carry bit is present on all 
32bit arches now, I think.

> 
>     The arm implementation of atomic.h is also relevant and is especially
>     interesting.  I looked at it after mdf@ pointed out ARM_RAS_START.
>     This seems to only be used by arm in userland.  It is even larger and
>     slower and more complicated than critical sections, so it doesn't seem
>     to be useful for counters.  In the kernel, arm has several
>     implementations depending on machine capabilities.  The ifdefs are
>     hard to understand.  On some machine, it seems to use the equivalent
>     of a cmpxchg loop.  On others, it just disables interrupts.  Disabling
>     interrupts is presumably better than a critical section, and userland
>     has to use ARM_RAS_START for some arches because neither disabling
>     interrupts not critical sections are available in userland.  None
>     of this care helps avoid the races in pcpu.h, since pcpu.h intentionally
>     avoids using atomic ops.  None of this care helps avoid the races in
>     counter.h or make counter.h efficient, since counter.h uses its own
>     methods.  To reach the same quality as atomic.h, counter.h would have
>     to copy half of the ifdefs and methods in atomic.h.
BTW, why do you claim that disabling interrupts is faster and better method
to disable preemption than critical section ?

> 
> The slowness that I care about is only in the counter update.  Anything
> complicated there will be slow.
> 
> My current best design:
> - use ordinary mutexes to protect counter fetches in non-per-CPU contexts.
> - use native-sized or always 32-bit counters.  Counter updates are done
>    by a single addl on i386.  Fix pcpu.h on arches other than amd64 and
>    i386 and use the same method as there.
> - counter fetches add the native-sized pcpu counters to 64-bit non-pcpu
>    counters, when the native-size counters are in danger of overflowing
>    or always, under the mutex.  Transferring data uses an ordinary
>    atomic_cmpset.  To avoid ifdefs, always use u_int pcpu counters.
>    The 64-bit non-pcpu counters can easily be changed to pcpu counters
>    if the API is extended to support pcpu data.
Mutex cannot be used for counters, I believe this was already discussed
to the end.  The goal of the counters design is to avoid the unneccesary
snoop traffic in the SMP system, mutex makes the hot line bouncing.
This was well understood and benchmarked.

> - run a daemon every few minutes to fetch all the counters, so that
>    the native-sized counters are in no danger of overflowing on systems
>    that don't run statistics programs often enough to fetch the counters
>    to actually use.
Daemon have to
1. track all counters in some global registry
2. run separate thread on each cpu.

I will add the missed MD bits to the current patch for i386, leaving
the known issue for fetches on other 32bit arches for now.

> 
> > ...
> > diff --git a/sys/kern/subr_counter.c b/sys/kern/subr_counter.c
> > index a98ed40..3222881 100644
> > --- a/sys/kern/subr_counter.c
> > +++ b/sys/kern/subr_counter.c
> > ...
> > @@ -49,14 +51,8 @@ counter_u64_zero(counter_u64_t c)
> > uint64_t
> > counter_u64_fetch(counter_u64_t c)
> > {
> > -	uint64_t r;
> > -	int i;
> >
> > -	r = 0;
> > -	for (i = 0; i < mp_ncpus; i++)
> > -		r += *(uint64_t *)((char *)c + sizeof(struct pcpu) * i);
> > -
> > -	return (r);
> > +	return (counter_u64_fetch_inline(c));
> > }
> 
> With my design:
> 
>  	extern struct mtx counter_locks[];
>  	extern uint64_t counters[];
>  	uint64_t r;
>  	volatile u_int *p;
>  	u_int v;
>  	int cnum;
> 
>  	cnum = ctonum(c);
>  	mtx_lock(&counter_locks[cnum]);	/* might not need 1 per counter */
>  	r = counters[cnum];
>  	for (i = 0; i < mp_ncpus; i++) {
>  		p = (u_int *)((char *)c + sizeof(struct pcpu) * i);
>  		v = *p;			/* don't care if it is stale */
>  		if (v >= 0x80000000) {
>  			/* Transfer only when above critical level. */
>  			while (atomic_cmpset_rel_int(p, v, 0) == 0)
>  				v = *p;	/* still don't care if it is stale */
>  			counters[cnum] += v;
>  		}
>  		r += v;
>  	}
>  	mtx_unlock(&counter_locks[cnum]);
>  	return (r);
> 
> Mutexes give some slowness in the fetching code, but fetches eare expected
> to be relatively rare.
As I noted above, mutexes cause line sharing and snoop traffic.

> 
> I added the complication to usually avoid atomic ops at the last minute.
> The complication might not be worth it.
> 
> Mutexes can be avoided in the usual case too if there is an atomic op
> to load the 64-bit counters, but I don't want to depend on that.  i386
> already has atomic load_acq and store_rel for 64-bit accesses, but
> these are not available on all arches and I consider their existence
> on i386 to be a bug.  On i386. they use cmpxch8b if available, else
> disable interrupts.  The store_rel one is never used, and the load_acq
> one is used twice to handle a tiny problem with TSC frequencies >=
> 2**32 Hz.  You didn't even use them here.  It seems to be just a
> micro-optimization to not use them (in the cmpxch8b case, it avoids
> the lock prefix for the SMP case, and in the critical section case,
> it allows putting the critical section around the whole loop).
Avoiding lock prefix is also essential part of the counter design.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 834 bytes
Desc: not available
URL: <http://lists.freebsd.org/pipermail/svn-src-all/attachments/20130625/f6bb61ab/attachment-0001.sig>