svn commit: r252032 - head/sys/amd64/include

Bruce Evans brde at optusnet.com.au
Thu Jun 20 22:45:43 UTC 2013


On Fri, 21 Jun 2013, I wrote:

> On Thu, 20 Jun 2013, Konstantin Belousov wrote:
>> ...
>> @@ -44,7 +44,7 @@ counter_u64_add(counter_u64_t c, int64_t
>> ...
> The i386 version of the counter asm doesn't support the immediate
> constraint for technical reasons.  64 bit counters are too large and
> slow to use on i386, especially when they are implemented as they are
> without races.

Actual testing showed that it is only about twice as slow as a direct
increment.  With the enclosed test program (a userland version hacked
on a bit to avoid pcpu), on ref10-i386 the times are:
- loop overhead:                                        1 cycle
- direct unlocked increment of a uint32_t:              6 cycles
- direct unlocked increment of a uint64_t:              7 cycles
- non-inline function unlocked increment of a uint64_t: 7.5 cycles
- counter_u64_add():                                   14 cycles
- non-inline counter_u64_add():                        18 cycles

Add many more when critical_enter()/exit() is needed.

I thought that a direct increment of a uint32_t took only 3 cycles.  This
is the documented time for i486.  4 cycles latency is documented for
AthlonxXP/64.  The carry check for incrementing a uint64_t is pipelined
on most modern i386, so it adds very little to this.

Nevertheless, the correct implementation of counters, once you have the
complexity of a counter API and can't just do counter++, is to use small
counters and run a daemon to accumulate them in larger counters before
they overflow.  pcpu accesses should allow simple counter++ accesses to
work for the smaller counters (except their address is in pcpu space).
But I don't see how sysctl accesses can work without lots of context
switches to reach strictly per-CPU context.  The current accumulation
of pcpu counters in places like vcnt() doesn't do that -- it accesses
pcpu counters for other CPUs, so has races.  The races are more serious
for accumulating counters into larger ones.  Then the smaller ones need
to be cleared atomically with copying them.  The accumulation daemon(s)
can run per-CPU more easily than sysctls, since the daemons don't need
to run as frequently as sysctls might.

Bruce


More information about the svn-src-all mailing list