network statistics in SMP

Wed Dec 16 18:19:50 UTC 2009

On Tue, 15 Dec 2009, John Baldwin wrote:

> On Tuesday 15 December 2009 12:45:13 pm Harti Brandt wrote:
>> On Tue, 15 Dec 2009, John Baldwin wrote:
>>
>> JB>On Tuesday 15 December 2009 4:38:04 am Harti Brandt wrote:
>> JB>> Hi all,
>> JB>>
>> JB>> I'm working on our network statistics (in the context of SNMP) and wonder,
>> JB>> to what extend we want them to be correct. I've re-read part of the past
>> JB>> discussions about 64-bit counters on 32-bit archs and got the impression,
>> JB>> that there are users that would like to have almost correct statistics
>> JB>> (for accounting, for example). If this is the case I wonder whether the
>> JB>> way we do the statistics today is correct.
>> JB>>
>> JB>> Basically all statistics are incremented or added to simply by a += b oder
>> JB>> a++. As I understand, this worked fine in the old days, where you had
>> JB>> spl*() calls at the right places. Nowadays when everything is SMP
>> JB>> shouldn't we use at least atomic operations for this? Also I read that on
>> JB>> architectures where cache coherency is not implemented in hardware even
>> JB>> this does not help (I found a mail from jhb why for the mutex
>> JB>> implementation this is not a problem, but I don't understand what to do
>> JB>> for the += and ++ operations). I failed to find a way, though, to
>> JB>> influence the caching policy (is there a function one can call to
>> JB>> change the policy?).
>> JB>
>> JB>Atomic ops will always work for reliable statistics.  However, I believe
>> JB>Robert is working on using per-CPU statistics for TCP, UDP, etc. similar to
>> JB>what we do now for many of the 'cnt' stats (context switches, etc.).  For
>> JB>'cnt' each CPU has its own count of stats that are updated using non-atomic
>> JB>ops (since they are CPU local).  sysctl handlers then sum up the various per-
>> JB>CPU counts to report global counts to userland.

I don't like the bloat from this, but don't see anything better.  Julian
said in another reply that there are even more complications for VIMAGE.

>> I see. I was also thinking along these lines, but was not sure whether it
>> is worth the trouble. I suppose this does not help to implement 64-bit
>> counters on 32-bit architectures, though, because you cannot read them
>> reliably without locking to sum them up, right?
>
> Either that or you just accept that you have a small race since it is only stats. :)

Actually, you can do better with a generation count.  The generation count
would at least tell you if you lost a race.  The generation count should
only be maintained while summing other counts, since it must be global and
incremented by atomic ops (to avoid the races without even more costly
locking which would make the generation count irrelevant) so maintaining
it all the time would more than defeat the point of having per-CPU counters
(all CPUs would compete for it at the same address).  Probably not worth
it for statistics.  Except, if userland had control over it, then userland
could decide the policy.

Actually2, this solves your original problem!, provided the races are
so rarely lost that looping to recover from them works: Once counters
are per-CPU, they can be 64-bits with no complications until they are
summed.  Detection of lost races is essential for summing them on
32-bit systems, unlike for 32-bit counters, since a lost race at the
point where the low 32 bits wraps around may give an error of 2**32
in the sum, while a lost race for a 32-bit counter only makes the sum
a bit too small (unless the 32-bit counter wrapped).

Simple version:
- bloat PCPU_INC(var) to do something like the following:
 	if (PCPU_GET(counter_summing_mode))
 		atomic_add_int(&counter_gen, 1);
 	OLD_PCPU_INC(var);
- set PCPU_GET(counter_summing_mode) while summing.  Needs heavyweight
   synchronization (IPIs?) to set and clear the flag on other CPUs.  Must
   also make all other CPUs flush pending writes (so that a 64-bit counter
   cannot be half-written at the beginning of the summing), but this will
   happen automatically with any heavyweight synchronization.

Unsimple versions: to avoid bloating PCPU_INC(), write-protect all
counters while summing, and count generations in the trap handler ...

However, I prefer summing 32-bit counters (with heuristics to detect
wraparound) to a 64-bit sum, like I think you already do for SNMP.

Wraparound heuristics may still be useful with the generation count:
suppose the generation count increases faster than you can sum; then
looping to get a coherent sum doesn't work, and wraparound must be
ruled out or fixed up in another way; the 32-bit wraparound heuristic
works perfectly since we can guarantee to sum faster than a 32-bit
counter can wrap twice.

Bruce