network statistics in SMP
Robert N. M. Watson
rwatson at FreeBSD.org
Sun Dec 20 13:47:03 UTC 2009
On 20 Dec 2009, at 13:19, Harti Brandt wrote:
> RW>Frequent writes to the same cache line across multiple cores are remarkably
> RW>expensive, as they trigger the cache coherency protocol (mileage may vary).
> RW>For example, a single non-atomically incremented counter cut performance of
> RW>gettimeofday() to 1/6th performance on an 8-core system when parallel system
> RW>calls were made across all cores. On many current systems, the cost of an
> RW>"atomic" operation is now fairly reasonable as long as the cache line is held
> RW>exclusively by the current CPU. However, if we can avoid them that has
> RW>value, as we update quite a few global stats on the way through the network
> Hmm. I'm not sure that gettimeofday() is comparable to forwarding an IP
> packet. I would expect, that a single increment is a good percentage of
> the entire processing (in terms of numbers of operations) for
> gettimeofday(), while in IP forwarding this is somewhere in the noise
> floor. In the simples case the packet is acted upon by the receiving
> driver, the IP input function, the IP output function and the sending
> driver. Not talking about IP filters, firewalls, tunnels, dummynet and
> what else. The relative cost of the increment should be much less. But, I
> may be wrong of course.
If processing is occurring on multiple CPUs -- for example, you are receiving UDP from two ithreads -- then 4-8 cache lines being contended due to stats is a lot. Our goal should be (for 9.0) to avoid having any contended cache lines in the common case when processing independent streams on different CPUs.
> I would really like to sort that out before any kind of ABI freeze
> happens. Ideally all the statistics would be accessible per sysctl(), have
> a version number and have all or most of the required statistics with a
> simple way to add new fields without breaking anything. Also the field
> sizes (64 vs. 32 bit) should be correct on the kernel - user interface.
> My current feeling after reading this thread is that the low-level kernel
> side stuff is probably out of what I could do with the time I have and
> would sidetrack me too far from the work on bsnmp. What I would like to do
> is to fix the kernel/user interface and let the people that now how to do
> it handle the low-level side.
> I would really not like to have to deal with a changing user/kernel
> interface in current if we go in several steps with the kernel stuff.
I think we should treat the statistics gathering and statistics reporting interfaces as entirely separable problems. Statistics are updated far more frequently than they are queried, so making the query process a bit more expensive (reformatting from an efficient 'update' format to an application-friendly 'report' format) should be fine.
One question to think about is whether or not simply cross-CPU summaries are sufficient, or whether we actually also want to be able to directly monitor per-CPU statistics at the IP layer. The former would maintain the status quo making per-CPU behavior purely part of the 'update' step; the latter would change the 'report' format as well. I've been focused primarily on 'update', but at least for my work it would be quite helpful to have per-CPU stats in the 'report' format as well.
> I will try to come up with a patch for the kernel/user interface in the
> mean time. This will be for 9.x only, obviously.
Sounds good -- and the kernel stats capture can "grow into" the full report format as it matures.
> Doesn't this help for output only? For the input statistics there still
> will be per-ifnet statistics.
Most ifnet-layer stats should really be per-queue, both for input and output, which may help.
> An interesting question from the SNMP point of view is, what happens to
> the statistics if you move around interfaces between vimages. In any case
> it would be good if we could abstract from all the complications while
> going kernel->userland.
At least for now, the interface is effectively recreated when it moves vimage, and only the current vimage is able to monitor it. That could be considered a bug but it might also be a simplifying assumption or even a feature. Likewise, it's worth remembering that the ifnet index space is per-vimage.
More information about the freebsd-arch