[CFR][CFT] counter(9): new API for faster and raceless counters

Tue Apr 2 14:56:04 UTC 2013

On Tue, Apr 02, 2013 at 06:17:17PM +0400, Gleb Smirnoff wrote:
>   Luigi,
> 
> On Tue, Apr 02, 2013 at 03:37:58AM +0200, Luigi Rizzo wrote:
> L> API:
...
> L> - there are several places which use multiple counters
> L>   (e.g. packet and byte counters, global and per flow/socket),
> L>   so i wonder if it may help to provide a "protected" version of
> L>   counter_u64_add() that requires the critical_enter/exit
> L>   only once. Something like
...
> Here is patch for review. It adds 4 more primitives:
> 
> counter_enter();
> counter_u64_add_prot(c, x);
> counter_u64_subtract_prot(c, x);
> counter_exit();

thanks for the patch. I have three more comments:

- is it really necessary to have the "subtract" version ?
  Couldn't one just make "x" an int64_t ? or it gives
  too many warnings at runtime maybe ?

- (this can be fixed later) in the i386 version, counter_enter()
  and counter_exit() have an if statement which may become quite
  expensive if mispredicted. Also, the test is repeated 3 times in
  counter_u64_add() (enter/add/exit). Hopefully the compiler
  optimizes out the extra calls, but the previous version seemed
  more readable. Anyways, at some point we should figure out
  whether putting likely/unlikely annotations on the result of
  (cpu_feature & CPUID_CX8) may improve performance where it matters.

- do you plan to provide an API to initialize a counter to 0 or a
  specific value ? I suppose this is done implicitly on allocation,
  but there are cases (e.g. ipfw) where the application explicitly
  zeroes counters.

> L> BENCHMARK:
> L> 
> L> > I've got a simple benchmark. A syscall that solely updates a counter is
> L> > implemented. Number of processes is spawned, equal to number of cores,
> L> > each process binds itself to a dedicated CPU and calls the syscall 10^6
> L> > times and exits. Parent wait(2)s for them all and I measure real time of
> L> 
> L> - I am under the impression that these benchmarks are dominated
> L>   by the syscall time, and the new counters would exhibit a lot
> L>   better relative performance (compared to racy or atomic)
> L>   by doing 10..100 counter ops per syscall. Any chance to retry
> L>   the test in this configuration ?
> 
> Ok, as you wish.
> 
> Apparently compiler optimises things like:
> 
> 	for (int i = 0; i < rounds; i++)
> 		the_counter += v;
> 
> To avoid optimisations here I declared the_counter as volatile. Is the
> benchmark fair after that? Anyway, here are results for (rounds == 2000000000):
> 
> x counter_u64_add(), result == 2000000000 * ncpus
> + racy increment, result == 2022232241 (!!!)
> +------------------------------------------------------------------------------+
> |  x                                                                   +       |
> |  x                                                                   +       |
> |  x                                                                   ++      |
> |  x                                                                   ++      |
> |  x        x                                                         +++     +|
> ||_MA__|                                                             |_MA_|    |
> +------------------------------------------------------------------------------+
>     N           Min           Max        Median           Avg        Stddev
> x  10             5          5.44          5.01         5.053    0.13605963
> +  10          8.16          8.53           8.2         8.235    0.10721215
> Difference at 95.0% confidence
>         3.182 +/- 0.115089
>         62.9725% +/- 2.27764%
>         (Student's t, pooled s = 0.122488)
> 
> So 63% speedup, not speaking on the fact that in such a tight loop 98% of
> parallel updates are lost on racy counter :)
> 
> A tight loop with atomic_add() is 22 times (twenty two times) slower than
> new counter. I didn't bother to run ministat :)

yes i think this really makes justice of the improvements of the new code
(i am a bit unclear on what actual test you ran / how many counter_u64_add()
per syscall you have, but i do expect the racy counters to be much slower
and much less reliable, and the 20x slowdown with atomics is completely
expected.)

cheers
luigi