[CFR][CFT] counter(9): new API for faster and raceless counters

Mon Apr 1 11:51:30 UTC 2013

  Hi!

  Together with Konstantin Belousov (kib@) we developed a new API that is
initially purposed for (but not limited to) collecting statistical
data in kernel.

  Right now in kernel a statistical counter not protected by a lock is
updated either using basic '+=' operator, or via an atomic(9) operation.
The first case is prone to data loss due to simultaneous update by more
than one CPU. The second case doesn't lose data, but locks memory bus
to establish atomic operation, effectively blocking other CPUs. Both cases
invalidate cache line associated with the address being updated, thus
provoking cache miss when other CPU will update a counter.

The situation with cache misses shows itself in situations when multiple
threads running on multiple CPUs access mostly their private dataset, and
a counter is the only write-shared data between them. Example of such
situation is packet forwarding, where struct ipstat and interface byte/packet
counters are the data that experiences excessive cache misses. This was
found by Alexander Chernikov (melifaro@) and Andrey Elsukov (ae@) at Yandex.
They report that removal of stat updates raises benchmarking results in packet
forwarding significantly.

  The decision is to make statistical counters private to CPUs. To achieve
that, the following steps were taken:

o uma(9) got support for UMA_ZONE_PCPU zones. A slab in UMA_ZONE_PCPU has
  mp_ncpu slabs following it in virtual memory, so that each CPU has its own
  slab. Allocation from such zone gives caller the base allocation address, and
  private address for a CPU can be calculated using base address + curcpu * slab
  size. The slab size in UMA_ZONE_PCPU is sizeof(struct pcpu). This sizing
  is required for optimisation described below.

  Access to private member is coded in tiny inline zpcpu_get().

  Note that UMA_ZONE_PCPU zones aren't dedicated for counter(9), they can be
  utilized for other purposes. You may use them for any kind of dynamic per-CPU
  data you need.

o Tiny API for counter(9):

     counter_u64_t
     counter_u64_alloc(int wait);

     void
     counter_u64_free(counter_u64_t cnt);

     void
     counter_u64_add(counter_u64_t cnt, uint64_t inc);

     uint64_t
     counter_u64_fetch(counter_u64_t cnt);

o MI implementation of counter_u64_add() is:

     critical_enter();
     *(uint64_t *)zpcpu_get(c) += inc;
     critical_exit();

o Since distance between members has special sizing, on 64-bit architectures
  that allow base+offset addressing, namely amd64, it is possible to perform
  update in single lockless intruction:

    __asm __volatile("addq\t%1,%%gs:(%0)"
        :
        : "r" ((char *)c - (char *)&__pcpu[0]), "r" (inc)
        : "memory", "cc");

  Here we exploit the fact that %gs always points to the current CPU static
  per-CPU data. And alignment of static per-CPU structures matches alignment
  of slabs in an UMA_ZONE_PCPU zone. Author of this idea is Konstantin (kib@).

I've got a simple benchmark. A syscall that solely updates a counter is
implemented. Number of processes is spawned, equal to number of cores,
each process binds itself to a dedicated CPU and calls the syscall 10^6
times and exits. Parent wait(2)s for them all and I measure real time of
the parent. Here are results for my 4-core Xeon E5620:

x new counter(9), resulting counter == mp_ncpus * 10^6
+ racy '+=', resulting counter ~25% less than expected
* atomic(9), resulting counter == mp_ncpus * 10^6
+------------------------------------------------------------------------------+
|x   +                                                                        *|
|x   +                                                                        *|
|x   +                                                                        *|
|x   +                                                                        *|
|x   +                                                                        *|
|x   +                                                                        *|
|x   +                                                                        *|
|x   +                                                                        *|
|x   +                                                                        *|
|x   +                                                                        *|
|A   A                                                                        A|
+------------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x  10         17.65         17.68         17.66        17.665  0.0097182532
+  10         18.53         18.56         18.55        18.543   0.011595018
Difference at 95.0% confidence
        0.878 +/- 0.0100517
        4.97028% +/- 0.0569016%
        (Student's t, pooled s = 0.0106979)
*  10         34.15          34.2         34.16        34.171   0.020789955
Difference at 95.0% confidence
        16.506 +/- 0.0152473
        93.439% +/- 0.0863138%
        (Student's t, pooled s = 0.0162275)

So, we got 5% speedup comparing to racy counter and 93% speedup comparing
to atomic(9).

If binding of processes to CPUs is omitted, results are:
x new counter(9), resulting counter == mp_ncpus * 10^6
+ racy '+=', resulting counter ~20% less than expected
* atomic(9), resulting counter == mp_ncpus * 10^6
+------------------------------------------------------------------------------+
|                 x                 +                                       *  |
|   x           x x   +      +  + + + +                               **    *  |
|xxxx   x +x xxxxxx  *+    x +  +++++ +     +                        *** *  * *|
|     |______A__M___|   |_______A_M____|                              |__A__|  |
+------------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x  20         19.04         23.94         21.83       21.3555      1.372836
+  20         20.69         27.33         25.44        24.902     1.4900957
Difference at 95.0% confidence
        3.5465 +/- 0.916971
        16.607% +/- 4.29384%
        (Student's t, pooled s = 1.43267)
*  10         31.97         33.77         32.85        32.819    0.63079227
Difference at 95.0% confidence
        11.4635 +/- 0.940783
        53.6794% +/- 4.40534%
        (Student's t, pooled s = 1.18608)

So, we got 16.6% speedup comparing to racy counter and 53.7% speedup comparing
to atomic(9). (Note that dispersion of results is much higher, when binding is
omitted. Not perfect scheduling in ULE?)

Results may vary depending on number of CPUs and cache architecture, not
speaking about actual usage of counter(9).

  Andrey (ae@) had benchmarked RX side of networking stack, where the IP
statistics and TCP statistics were converted to counter(9). He reports that
although maximum pps didn't increase measurably, the CPU utilisation had
dropped from 94% to 45%. Probably something else in the stack prevented
to receive more pps, despite of more CPU cycles available.

The code is in subversion branch:

  http://svn.freebsd.org/base/projects/counters/

If there are no objections during this week, I'd like to merge the code
to head.

-- 
Totus tuus, Glebius.