[CFR][CFT] counter(9): new API for faster and raceless counters
Gleb Smirnoff
glebius at FreeBSD.org
Tue Apr 2 14:17:20 UTC 2013
Luigi,
On Tue, Apr 02, 2013 at 03:37:58AM +0200, Luigi Rizzo wrote:
L> API:
L>
L> > o MI implementation of counter_u64_add() is:
L> >
L> > critical_enter();
L> > *(uint64_t *)zpcpu_get(c) += inc;
L> > critical_exit();
L>
L> - there are several places which use multiple counters
L> (e.g. packet and byte counters, global and per flow/socket),
L> so i wonder if it may help to provide a "protected" version of
L> counter_u64_add() that requires the critical_enter/exit
L> only once. Something like
L>
L> PROTECT_COUNTERS(
L> safe_counter_u64_add(c, x);
L> safe_counter_u64_add(c, x);
L> safe_counter_u64_add(c, x);
L> );
L>
L> where PROTECT_COUNTERS() would translate into the critical_enter/exit
L> where required, and nothing on other architectures.
Here is patch for review. It adds 4 more primitives:
counter_enter();
counter_u64_add_prot(c, x);
counter_u64_subtract_prot(c, x);
counter_exit();
L> BENCHMARK:
L>
L> > I've got a simple benchmark. A syscall that solely updates a counter is
L> > implemented. Number of processes is spawned, equal to number of cores,
L> > each process binds itself to a dedicated CPU and calls the syscall 10^6
L> > times and exits. Parent wait(2)s for them all and I measure real time of
L>
L> - I am under the impression that these benchmarks are dominated
L> by the syscall time, and the new counters would exhibit a lot
L> better relative performance (compared to racy or atomic)
L> by doing 10..100 counter ops per syscall. Any chance to retry
L> the test in this configuration ?
Ok, as you wish.
Apparently compiler optimises things like:
for (int i = 0; i < rounds; i++)
the_counter += v;
To avoid optimisations here I declared the_counter as volatile. Is the
benchmark fair after that? Anyway, here are results for (rounds == 2000000000):
x counter_u64_add(), result == 2000000000 * ncpus
+ racy increment, result == 2022232241 (!!!)
+------------------------------------------------------------------------------+
| x + |
| x + |
| x ++ |
| x ++ |
| x x +++ +|
||_MA__| |_MA_| |
+------------------------------------------------------------------------------+
N Min Max Median Avg Stddev
x 10 5 5.44 5.01 5.053 0.13605963
+ 10 8.16 8.53 8.2 8.235 0.10721215
Difference at 95.0% confidence
3.182 +/- 0.115089
62.9725% +/- 2.27764%
(Student's t, pooled s = 0.122488)
So 63% speedup, not speaking on the fact that in such a tight loop 98% of
parallel updates are lost on racy counter :)
A tight loop with atomic_add() is 22 times (twenty two times) slower than
new counter. I didn't bother to run ministat :)
--
Totus tuus, Glebius.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: counter_API_extend.diff
Type: text/x-diff
Size: 7762 bytes
Desc: not available
URL: <http://lists.freebsd.org/pipermail/freebsd-arch/attachments/20130402/17e518b3/attachment.diff>
More information about the freebsd-arch
mailing list