network statistics in SMP

Sun Dec 20 13:19:40 UTC 2009

On Sun, 20 Dec 2009, Robert Watson wrote:

RW>
RW>On Sat, 19 Dec 2009, Harti Brandt wrote:
RW>
RW>> To be honest, I'm lost now. Couldn't we just use the largest atomic type
RW>> for the given platform and atomic_inc/atomic_add/atomic_fetch and handle
RW>> the 32->64 bit stuff (for IA32) as I do it in bsnmp, but as a kernel
RW>> thread?
RW>> 
RW>> Are the 5-6 atomic operations really that costly given the many operations
RW>> done on an IP packet? Are they more costly than a heavyweight sync for each
RW>> ++ or +=?
RW>
RW>Frequent writes to the same cache line across multiple cores are remarkably
RW>expensive, as they trigger the cache coherency protocol (mileage may vary).
RW>For example, a single non-atomically incremented counter cut performance of
RW>gettimeofday() to 1/6th performance on an 8-core system when parallel system
RW>calls were made across all cores.  On many current systems, the cost of an
RW>"atomic" operation is now fairly reasonable as long as the cache line is held
RW>exclusively by the current CPU.  However, if we can avoid them that has
RW>value, as we update quite a few global stats on the way through the network
RW>stack.

Hmm. I'm not sure that gettimeofday() is comparable to forwarding an IP 
packet. I would expect, that a single increment is a good percentage of 
the entire processing (in terms of numbers of operations) for 
gettimeofday(), while in IP forwarding this is somewhere in the noise 
floor. In the simples case the packet is acted upon by the receiving 
driver, the IP input function, the IP output function and the sending 
driver. Not talking about IP filters, firewalls, tunnels, dummynet and 
what else. The relative cost of the increment should be much less. But, I 
may be wrong of course.

RW>
RW>> Or we could use the PCPU stuff, use just ++ and += for modifying the
RW>> statistics (32bit) and do the 32->64 bit stuff for all platforms with a
RW>> kernel thread per CPU (do we have this?). Between that thread and the
RW>> sysctl we could use a heavy sync.
RW>
RW>The current short-term plan is to move do this but without a syncer thread:
RW>we'll just aggregate the results when they need to be reported, in the sysctl
RW>path.  How best to scale to 64-bit counters is an interesting question, but
RW>one we can address after per-CPU stats are in place, which address an
RW>immediate performance (rather than statistics accuracy) concern.

Well, the user side of our statistics is in a very bad shape and I have 
problems in handling this in the SNMP daemon. Just a number of examples:

interface statistics:
  - they use u_long, so are either 32-bit or 64-bit depending on the 
    platform
  - a number of required statistics is missing
  - send drops are somewhere else and are 'int'
  - statistics are embedded into struct ifnet (bad for ABI stability) and
    not versioned
  - accessed together with other unrelated information via sysctl()

IPv4 statistics:
  - also u_long (hence different size on the platforms)
  - a lot of fields required by SNMP is missing
  - not versioned
  - accessed via sysctl()
  - per interface statistics totally missing

IPv6 statistics:
  - u_quad_t! so they are suspect to race conditions on 32-bit platforms 
    and, maybe?, on 64-bit platforms
  - a lot of fields requred by SNMP is missing
  - not versioned
  - accessed via sysctl(); per interface statistics via ioctl()

Ethernet statistics:
  - u_long
  - some fields missing
  - implemented in only 3! drivers; some drivers use the corresponding 
    field for something else
  - not versioned

I think, TCP and UDP statistics are equally bad shaped.

I would really like to sort that out before any kind of ABI freeze 
happens. Ideally all the statistics would be accessible per sysctl(), have 
a version number and have all or most of the required statistics with a 
simple way to add new fields without breaking anything. Also the field 
sizes (64 vs. 32 bit) should be correct on the kernel - user interface.

My current feeling after reading this thread is that the low-level kernel 
side stuff is probably out of what I could do with the time I have and 
would sidetrack me too far from the work on bsnmp. What I would like to do 
is to fix the kernel/user interface and let the people that now how to do 
it handle the low-level side.

I would really not like to have to deal with a changing user/kernel 
interface in current if we go in several steps with the kernel stuff.

RW>> Using 32 bit stats may fail if you put in several 10GBit/s adapters into a
RW>> machine and do routing at link speed, though. This might overflow the IP
RW>> input/output byte counter (which we don't have yet) too fast.
RW>
RW>For byte counters, assuming one 10gbps stream, a 32-bit counter wraps in
RW>about three seconds.  Systems processing 40gbps a second are now quite
RW>realistic, although typically workloads of that sort will be distributed over
RW>16+ cores and using multiple 10gbps NICs.
RW>
RW>My thinking is that we get the switch to per-CPU stats done in 9.x in the
RW>next month sometime, and also get it merged to 8.x a month or so later (I
RW>merged the wrapper macros necessary to do that before 8.0 but didn't have
RW>time to fully evaluate the performance implications of the implementation
RW>switch).

I will try to come up with a patch for the kernel/user interface in the 
mean time. This will be for 9.x only, obviously.

RW>There are two known areas of problem here:
RW>
RW>(1) The cross-product issue with virtual network stacks
RW>(2) The cross-product issue with network interfaces for per-interface stats
RW>
RW>I propose to ignore (1) for now by simply having only vnet0 use per-CPU
RW>stats, and other vnets use single-instance per-vnet stats.  We can solve the
RW>larger problem there at a future date.

This sounds reasonable if we wrap all the statistics stuff into macros 
and/or functions.

RW>I don't have a good proposal for (2) -- the answer may be using DPCPU memory,
RW>but that will require use to support more dynamic allocation ranges, which
RW>may add cost.  (Right now, the DPCPU allocator relies on relatively static
RW>allocations over time).  This means that, for now, we may also ignore that
RW>issue and leave interface counters as-is.  This is probably a good idea
RW>because we also need to deal with multi-queue interfaces better, and perhaps
RW>the stats should be per-queue rather than per-ifnet, which may itself help
RW>address the cache line issue.

Doesn't this help for output only? For the input statistics there still 
will be per-ifnet statistics.

An interesting question from the SNMP point of view is, what happens to 
the statistics if you move around interfaces between vimages. In any case 
it would be good if we could abstract from all the complications while 
going kernel->userland.

harti