network statistics in SMP

Sun Dec 20 12:13:47 UTC 2009

On Sat, 19 Dec 2009, Harti Brandt wrote:

> To be honest, I'm lost now. Couldn't we just use the largest atomic type for 
> the given platform and atomic_inc/atomic_add/atomic_fetch and handle the 
> 32->64 bit stuff (for IA32) as I do it in bsnmp, but as a kernel thread?
>
> Are the 5-6 atomic operations really that costly given the many operations 
> done on an IP packet? Are they more costly than a heavyweight sync for each 
> ++ or +=?

Frequent writes to the same cache line across multiple cores are remarkably 
expensive, as they trigger the cache coherency protocol (mileage may vary). 
For example, a single non-atomically incremented counter cut performance of 
gettimeofday() to 1/6th performance on an 8-core system when parallel system 
calls were made across all cores.  On many current systems, the cost of an 
"atomic" operation is now fairly reasonable as long as the cache line is held 
exclusively by the current CPU.  However, if we can avoid them that has value, 
as we update quite a few global stats on the way through the network stack.

> Or we could use the PCPU stuff, use just ++ and += for modifying the 
> statistics (32bit) and do the 32->64 bit stuff for all platforms with a 
> kernel thread per CPU (do we have this?). Between that thread and the sysctl 
> we could use a heavy sync.

The current short-term plan is to move do this but without a syncer thread: 
we'll just aggregate the results when they need to be reported, in the sysctl 
path.  How best to scale to 64-bit counters is an interesting question, but 
one we can address after per-CPU stats are in place, which address an 
immediate performance (rather than statistics accuracy) concern.

> Using 32 bit stats may fail if you put in several 10GBit/s adapters into a 
> machine and do routing at link speed, though. This might overflow the IP 
> input/output byte counter (which we don't have yet) too fast.

For byte counters, assuming one 10gbps stream, a 32-bit counter wraps in about 
three seconds.  Systems processing 40gbps a second are now quite realistic, 
although typically workloads of that sort will be distributed over 16+ cores 
and using multiple 10gbps NICs.

My thinking is that we get the switch to per-CPU stats done in 9.x in the next 
month sometime, and also get it merged to 8.x a month or so later (I merged 
the wrapper macros necessary to do that before 8.0 but didn't have time to 
fully evaluate the performance implications of the implementation switch).

There are two known areas of problem here:

(1) The cross-product issue with virtual network stacks
(2) The cross-product issue with network interfaces for per-interface stats

I propose to ignore (1) for now by simply having only vnet0 use per-CPU stats, 
and other vnets use single-instance per-vnet stats.  We can solve the larger 
problem there at a future date.

I don't have a good proposal for (2) -- the answer may be using DPCPU memory, 
but that will require use to support more dynamic allocation ranges, which may 
add cost.  (Right now, the DPCPU allocator relies on relatively static 
allocations over time).  This means that, for now, we may also ignore that 
issue and leave interface counters as-is.  This is probably a good idea 
because we also need to deal with multi-queue interfaces better, and perhaps 
the stats should be per-queue rather than per-ifnet, which may itself help 
address the cache line issue.

Robert N M Watson
Computer Laboratory
University of Cambridge