network statistics in SMP
hartmut.brandt at dlr.de
Sun Dec 20 14:18:15 UTC 2009
On Sun, 20 Dec 2009, Robert N. M. Watson wrote:
RNMW>On 20 Dec 2009, at 13:19, Harti Brandt wrote:
RNMW>> RW>Frequent writes to the same cache line across multiple cores are remarkably
RNMW>> RW>expensive, as they trigger the cache coherency protocol (mileage may vary).
RNMW>> RW>For example, a single non-atomically incremented counter cut performance of
RNMW>> RW>gettimeofday() to 1/6th performance on an 8-core system when parallel system
RNMW>> RW>calls were made across all cores. On many current systems, the cost of an
RNMW>> RW>"atomic" operation is now fairly reasonable as long as the cache line is held
RNMW>> RW>exclusively by the current CPU. However, if we can avoid them that has
RNMW>> RW>value, as we update quite a few global stats on the way through the network
RNMW>> Hmm. I'm not sure that gettimeofday() is comparable to forwarding an IP
RNMW>> packet. I would expect, that a single increment is a good percentage of
RNMW>> the entire processing (in terms of numbers of operations) for
RNMW>> gettimeofday(), while in IP forwarding this is somewhere in the noise
RNMW>> floor. In the simples case the packet is acted upon by the receiving
RNMW>> driver, the IP input function, the IP output function and the sending
RNMW>> driver. Not talking about IP filters, firewalls, tunnels, dummynet and
RNMW>> what else. The relative cost of the increment should be much less. But, I
RNMW>> may be wrong of course.
RNMW>If processing is occurring on multiple CPUs -- for example, you are receiving UDP from two ithreads -- then 4-8 cache lines being contended due to stats is a lot. Our goal should be (for 9.0) to avoid having any contended cache lines in the common case when processing independent streams on different CPUs.
RNMW>> I would really like to sort that out before any kind of ABI freeze
RNMW>> happens. Ideally all the statistics would be accessible per sysctl(), have
RNMW>> a version number and have all or most of the required statistics with a
RNMW>> simple way to add new fields without breaking anything. Also the field
RNMW>> sizes (64 vs. 32 bit) should be correct on the kernel - user interface.
RNMW>> My current feeling after reading this thread is that the low-level kernel
RNMW>> side stuff is probably out of what I could do with the time I have and
RNMW>> would sidetrack me too far from the work on bsnmp. What I would like to do
RNMW>> is to fix the kernel/user interface and let the people that now how to do
RNMW>> it handle the low-level side.
RNMW>> I would really not like to have to deal with a changing user/kernel
RNMW>> interface in current if we go in several steps with the kernel stuff.
RNMW>I think we should treat the statistics gathering and statistics
RNMW>reporting interfaces as entirely separable problems. Statistics are
RNMW>updated far more frequently than they are queried, so making the
RNMW>query process a bit more expensive (reformatting from an efficient
RNMW>'update' format to an application-friendly 'report' format) should be
RNMW>One question to think about is whether or not simply cross-CPU
RNMW>summaries are sufficient, or whether we actually also want to be able
RNMW>to directly monitor per-CPU statistics at the IP layer. The former
RNMW>would maintain the status quo making per-CPU behavior purely part of
RNMW>the 'update' step; the latter would change the 'report' format as
RNMW>well. I've been focused primarily on 'update', but at least for my
RNMW>work it would be quite helpful to have per-CPU stats in the 'report'
RNMW>format as well.
No problem. I can even add that in a private SNMP MIB if it seems useful.
RNMW>> I will try to come up with a patch for the kernel/user interface in the
RNMW>> mean time. This will be for 9.x only, obviously.
RNMW>Sounds good -- and the kernel stats capture can "grow into" the full
RNMW>report format as it matures.
RNMW>> Doesn't this help for output only? For the input statistics there still
RNMW>> will be per-ifnet statistics.
RNMW>Most ifnet-layer stats should really be per-queue, both for input and
RNMW>output, which may help.
As far as I can see currently the driver just calls if_input which is the
interface dependend input function. There seems to be no
driver-independent abstraction of input queues. (The hatm driver I wrote
several years ago has to input queues in hardware corresponding to 4 (or
8?) interrupt queues, but somewhere in the driver you put all of this
through the single if_input hook). Or is there something I'm missing?
RNMW>> An interesting question from the SNMP point of view is, what happens to
RNMW>> the statistics if you move around interfaces between vimages. In any case
RNMW>> it would be good if we could abstract from all the complications while
RNMW>> going kernel->userland.
RNMW>At least for now, the interface is effectively recreated when it
RNMW>moves vimage, and only the current vimage is able to monitor it. That
RNMW>could be considered a bug but it might also be a simplifying
RNMW>assumption or even a feature. Likewise, it's worth remembering that
RNMW>the ifnet index space is per-vimage.
I was already thinking about how to fit the vimage stuff into the SNMP
model. The simplest way is to run one SNMP daemon per vimage. Next comes
having one daemon that has one context per vimage. Bsnmpd does its own
mapping of system ifnet indexes to SNMP interface indexes, because the
allocation of system ifnet indexes does not fit to the RFC requirements.
This means it will detect when an interface is moved away from a vimage
and comes back later. If the kernel statistics are stable over these
movements, there is no need to declare a counter discontinuity via SNMP.
On the other hand these operations are probably seldom enough ...
More information about the freebsd-arch