network statistics in SMP
rwatson at FreeBSD.org
Sun Dec 20 12:13:47 UTC 2009
On Sat, 19 Dec 2009, Harti Brandt wrote:
> To be honest, I'm lost now. Couldn't we just use the largest atomic type for
> the given platform and atomic_inc/atomic_add/atomic_fetch and handle the
> 32->64 bit stuff (for IA32) as I do it in bsnmp, but as a kernel thread?
> Are the 5-6 atomic operations really that costly given the many operations
> done on an IP packet? Are they more costly than a heavyweight sync for each
> ++ or +=?
Frequent writes to the same cache line across multiple cores are remarkably
expensive, as they trigger the cache coherency protocol (mileage may vary).
For example, a single non-atomically incremented counter cut performance of
gettimeofday() to 1/6th performance on an 8-core system when parallel system
calls were made across all cores. On many current systems, the cost of an
"atomic" operation is now fairly reasonable as long as the cache line is held
exclusively by the current CPU. However, if we can avoid them that has value,
as we update quite a few global stats on the way through the network stack.
> Or we could use the PCPU stuff, use just ++ and += for modifying the
> statistics (32bit) and do the 32->64 bit stuff for all platforms with a
> kernel thread per CPU (do we have this?). Between that thread and the sysctl
> we could use a heavy sync.
The current short-term plan is to move do this but without a syncer thread:
we'll just aggregate the results when they need to be reported, in the sysctl
path. How best to scale to 64-bit counters is an interesting question, but
one we can address after per-CPU stats are in place, which address an
immediate performance (rather than statistics accuracy) concern.
> Using 32 bit stats may fail if you put in several 10GBit/s adapters into a
> machine and do routing at link speed, though. This might overflow the IP
> input/output byte counter (which we don't have yet) too fast.
For byte counters, assuming one 10gbps stream, a 32-bit counter wraps in about
three seconds. Systems processing 40gbps a second are now quite realistic,
although typically workloads of that sort will be distributed over 16+ cores
and using multiple 10gbps NICs.
My thinking is that we get the switch to per-CPU stats done in 9.x in the next
month sometime, and also get it merged to 8.x a month or so later (I merged
the wrapper macros necessary to do that before 8.0 but didn't have time to
fully evaluate the performance implications of the implementation switch).
There are two known areas of problem here:
(1) The cross-product issue with virtual network stacks
(2) The cross-product issue with network interfaces for per-interface stats
I propose to ignore (1) for now by simply having only vnet0 use per-CPU stats,
and other vnets use single-instance per-vnet stats. We can solve the larger
problem there at a future date.
I don't have a good proposal for (2) -- the answer may be using DPCPU memory,
but that will require use to support more dynamic allocation ranges, which may
add cost. (Right now, the DPCPU allocator relies on relatively static
allocations over time). This means that, for now, we may also ignore that
issue and leave interface counters as-is. This is probably a good idea
because we also need to deal with multi-queue interfaces better, and perhaps
the stats should be per-queue rather than per-ifnet, which may itself help
address the cache line issue.
Robert N M Watson
University of Cambridge
More information about the freebsd-arch