network statistics in SMP

Sun Dec 20 14:18:15 UTC 2009

On Sun, 20 Dec 2009, Robert N. M. Watson wrote:

RNMW>
RNMW>On 20 Dec 2009, at 13:19, Harti Brandt wrote:
RNMW>
RNMW>> RW>Frequent writes to the same cache line across multiple cores are remarkably
RNMW>> RW>expensive, as they trigger the cache coherency protocol (mileage may vary).
RNMW>> RW>For example, a single non-atomically incremented counter cut performance of
RNMW>> RW>gettimeofday() to 1/6th performance on an 8-core system when parallel system
RNMW>> RW>calls were made across all cores.  On many current systems, the cost of an
RNMW>> RW>"atomic" operation is now fairly reasonable as long as the cache line is held
RNMW>> RW>exclusively by the current CPU.  However, if we can avoid them that has
RNMW>> RW>value, as we update quite a few global stats on the way through the network
RNMW>> RW>stack.
RNMW>> 
RNMW>> Hmm. I'm not sure that gettimeofday() is comparable to forwarding an IP 
RNMW>> packet. I would expect, that a single increment is a good percentage of 
RNMW>> the entire processing (in terms of numbers of operations) for 
RNMW>> gettimeofday(), while in IP forwarding this is somewhere in the noise 
RNMW>> floor. In the simples case the packet is acted upon by the receiving 
RNMW>> driver, the IP input function, the IP output function and the sending 
RNMW>> driver. Not talking about IP filters, firewalls, tunnels, dummynet and 
RNMW>> what else. The relative cost of the increment should be much less. But, I 
RNMW>> may be wrong of course.
RNMW>
RNMW>If processing is occurring on multiple CPUs -- for example, you are receiving UDP from two ithreads -- then 4-8 cache lines being contended due to stats is a lot. Our goal should be (for 9.0) to avoid having any contended cache lines in the common case when processing independent streams on different CPUs.
RNMW>
RNMW>> I would really like to sort that out before any kind of ABI freeze 
RNMW>> happens. Ideally all the statistics would be accessible per sysctl(), have 
RNMW>> a version number and have all or most of the required statistics with a 
RNMW>> simple way to add new fields without breaking anything. Also the field 
RNMW>> sizes (64 vs. 32 bit) should be correct on the kernel - user interface.
RNMW>> 
RNMW>> My current feeling after reading this thread is that the low-level kernel 
RNMW>> side stuff is probably out of what I could do with the time I have and 
RNMW>> would sidetrack me too far from the work on bsnmp. What I would like to do 
RNMW>> is to fix the kernel/user interface and let the people that now how to do 
RNMW>> it handle the low-level side.
RNMW>> 
RNMW>> I would really not like to have to deal with a changing user/kernel 
RNMW>> interface in current if we go in several steps with the kernel stuff.
RNMW>
RNMW>I think we should treat the statistics gathering and statistics 
RNMW>reporting interfaces as entirely separable problems. Statistics are 
RNMW>updated far more frequently than they are queried, so making the 
RNMW>query process a bit more expensive (reformatting from an efficient 
RNMW>'update' format to an application-friendly 'report' format) should be 
RNMW>fine.
RNMW>
RNMW>One question to think about is whether or not simply cross-CPU 
RNMW>summaries are sufficient, or whether we actually also want to be able 
RNMW>to directly monitor per-CPU statistics at the IP layer. The former 
RNMW>would maintain the status quo making per-CPU behavior purely part of 
RNMW>the 'update' step; the latter would change the 'report' format as 
RNMW>well. I've been focused primarily on 'update', but at least for my 
RNMW>work it would be quite helpful to have per-CPU stats in the 'report' 
RNMW>format as well.

No problem. I can even add that in a private SNMP MIB if it seems useful.

RNMW>
RNMW>> I will try to come up with a patch for the kernel/user interface in the 
RNMW>> mean time. This will be for 9.x only, obviously.
RNMW>
RNMW>Sounds good -- and the kernel stats capture can "grow into" the full 
RNMW>report format as it matures.
RNMW>
RNMW>> Doesn't this help for output only? For the input statistics there still 
RNMW>> will be per-ifnet statistics.
RNMW>
RNMW>Most ifnet-layer stats should really be per-queue, both for input and 
RNMW>output, which may help.

As far as I can see currently the driver just calls if_input which is the 
interface dependend input function. There seems to be no 
driver-independent abstraction of input queues. (The hatm driver I wrote 
several years ago has to input queues in hardware corresponding to 4 (or 
8?) interrupt queues, but somewhere in the driver you put all of this 
through the single if_input hook). Or is there something I'm missing?

RNMW>> An interesting question from the SNMP point of view is, what happens to 
RNMW>> the statistics if you move around interfaces between vimages. In any case 
RNMW>> it would be good if we could abstract from all the complications while 
RNMW>> going kernel->userland.
RNMW>
RNMW>At least for now, the interface is effectively recreated when it 
RNMW>moves vimage, and only the current vimage is able to monitor it. That 
RNMW>could be considered a bug but it might also be a simplifying 
RNMW>assumption or even a feature. Likewise, it's worth remembering that 
RNMW>the ifnet index space is per-vimage.

I was already thinking about how to fit the vimage stuff into the SNMP 
model. The simplest way is to run one SNMP daemon per vimage. Next comes 
having one daemon that has one context per vimage. Bsnmpd does its own 
mapping of system ifnet indexes to SNMP interface indexes, because the 
allocation of system ifnet indexes does not fit to the RFC requirements. 
This means it will detect when an interface is moved away from a vimage 
and comes back later. If the kernel statistics are stable over these 
movements, there is no need to declare a counter discontinuity via SNMP.
On the other hand these operations are probably seldom enough ...

harti