[CFR][CFT] counter(9): new API for faster and raceless counters
Luigi Rizzo
rizzo at iet.unipi.it
Tue Apr 2 14:56:04 UTC 2013
On Tue, Apr 02, 2013 at 06:17:17PM +0400, Gleb Smirnoff wrote:
> Luigi,
>
> On Tue, Apr 02, 2013 at 03:37:58AM +0200, Luigi Rizzo wrote:
> L> API:
...
> L> - there are several places which use multiple counters
> L> (e.g. packet and byte counters, global and per flow/socket),
> L> so i wonder if it may help to provide a "protected" version of
> L> counter_u64_add() that requires the critical_enter/exit
> L> only once. Something like
...
> Here is patch for review. It adds 4 more primitives:
>
> counter_enter();
> counter_u64_add_prot(c, x);
> counter_u64_subtract_prot(c, x);
> counter_exit();
thanks for the patch. I have three more comments:
- is it really necessary to have the "subtract" version ?
Couldn't one just make "x" an int64_t ? or it gives
too many warnings at runtime maybe ?
- (this can be fixed later) in the i386 version, counter_enter()
and counter_exit() have an if statement which may become quite
expensive if mispredicted. Also, the test is repeated 3 times in
counter_u64_add() (enter/add/exit). Hopefully the compiler
optimizes out the extra calls, but the previous version seemed
more readable. Anyways, at some point we should figure out
whether putting likely/unlikely annotations on the result of
(cpu_feature & CPUID_CX8) may improve performance where it matters.
- do you plan to provide an API to initialize a counter to 0 or a
specific value ? I suppose this is done implicitly on allocation,
but there are cases (e.g. ipfw) where the application explicitly
zeroes counters.
> L> BENCHMARK:
> L>
> L> > I've got a simple benchmark. A syscall that solely updates a counter is
> L> > implemented. Number of processes is spawned, equal to number of cores,
> L> > each process binds itself to a dedicated CPU and calls the syscall 10^6
> L> > times and exits. Parent wait(2)s for them all and I measure real time of
> L>
> L> - I am under the impression that these benchmarks are dominated
> L> by the syscall time, and the new counters would exhibit a lot
> L> better relative performance (compared to racy or atomic)
> L> by doing 10..100 counter ops per syscall. Any chance to retry
> L> the test in this configuration ?
>
> Ok, as you wish.
>
> Apparently compiler optimises things like:
>
> for (int i = 0; i < rounds; i++)
> the_counter += v;
>
> To avoid optimisations here I declared the_counter as volatile. Is the
> benchmark fair after that? Anyway, here are results for (rounds == 2000000000):
>
> x counter_u64_add(), result == 2000000000 * ncpus
> + racy increment, result == 2022232241 (!!!)
> +------------------------------------------------------------------------------+
> | x + |
> | x + |
> | x ++ |
> | x ++ |
> | x x +++ +|
> ||_MA__| |_MA_| |
> +------------------------------------------------------------------------------+
> N Min Max Median Avg Stddev
> x 10 5 5.44 5.01 5.053 0.13605963
> + 10 8.16 8.53 8.2 8.235 0.10721215
> Difference at 95.0% confidence
> 3.182 +/- 0.115089
> 62.9725% +/- 2.27764%
> (Student's t, pooled s = 0.122488)
>
> So 63% speedup, not speaking on the fact that in such a tight loop 98% of
> parallel updates are lost on racy counter :)
>
> A tight loop with atomic_add() is 22 times (twenty two times) slower than
> new counter. I didn't bother to run ministat :)
yes i think this really makes justice of the improvements of the new code
(i am a bit unclear on what actual test you ran / how many counter_u64_add()
per syscall you have, but i do expect the racy counters to be much slower
and much less reliable, and the 20x slowdown with atomics is completely
expected.)
cheers
luigi
More information about the freebsd-arch
mailing list