[CFR][CFT] counter(9): new API for faster and raceless counters

Tue Apr 2 14:17:20 UTC 2013

  Luigi,

On Tue, Apr 02, 2013 at 03:37:58AM +0200, Luigi Rizzo wrote:
L> API:
L> 
L> > o MI implementation of counter_u64_add() is:
L> > 
L> >      critical_enter();
L> >      *(uint64_t *)zpcpu_get(c) += inc;
L> >      critical_exit();
L> 
L> - there are several places which use multiple counters
L>   (e.g. packet and byte counters, global and per flow/socket),
L>   so i wonder if it may help to provide a "protected" version of
L>   counter_u64_add() that requires the critical_enter/exit
L>   only once. Something like
L> 
L> 	PROTECT_COUNTERS(
L> 		safe_counter_u64_add(c, x);
L> 		safe_counter_u64_add(c, x);
L> 		safe_counter_u64_add(c, x);
L> 	);
L> 
L>   where PROTECT_COUNTERS() would translate into the critical_enter/exit
L>   where required, and nothing on other architectures.

Here is patch for review. It adds 4 more primitives:

counter_enter();
counter_u64_add_prot(c, x);
counter_u64_subtract_prot(c, x);
counter_exit();

L> BENCHMARK:
L> 
L> > I've got a simple benchmark. A syscall that solely updates a counter is
L> > implemented. Number of processes is spawned, equal to number of cores,
L> > each process binds itself to a dedicated CPU and calls the syscall 10^6
L> > times and exits. Parent wait(2)s for them all and I measure real time of
L> 
L> - I am under the impression that these benchmarks are dominated
L>   by the syscall time, and the new counters would exhibit a lot
L>   better relative performance (compared to racy or atomic)
L>   by doing 10..100 counter ops per syscall. Any chance to retry
L>   the test in this configuration ?

Ok, as you wish.

Apparently compiler optimises things like:

	for (int i = 0; i < rounds; i++)
		the_counter += v;

To avoid optimisations here I declared the_counter as volatile. Is the
benchmark fair after that? Anyway, here are results for (rounds == 2000000000):

x counter_u64_add(), result == 2000000000 * ncpus
+ racy increment, result == 2022232241 (!!!)
+------------------------------------------------------------------------------+
|  x                                                                   +       |
|  x                                                                   +       |
|  x                                                                   ++      |
|  x                                                                   ++      |
|  x        x                                                         +++     +|
||_MA__|                                                             |_MA_|    |
+------------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x  10             5          5.44          5.01         5.053    0.13605963
+  10          8.16          8.53           8.2         8.235    0.10721215
Difference at 95.0% confidence
        3.182 +/- 0.115089
        62.9725% +/- 2.27764%
        (Student's t, pooled s = 0.122488)

So 63% speedup, not speaking on the fact that in such a tight loop 98% of
parallel updates are lost on racy counter :)

A tight loop with atomic_add() is 22 times (twenty two times) slower than
new counter. I didn't bother to run ministat :)

-- 
Totus tuus, Glebius.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: counter_API_extend.diff
Type: text/x-diff
Size: 7762 bytes
Desc: not available
URL: <http://lists.freebsd.org/pipermail/freebsd-arch/attachments/20130402/17e518b3/attachment.diff>