network statistics in SMP

Sat Dec 19 20:01:35 UTC 2009

On Sun, 20 Dec 2009, Bruce Evans wrote:

BE>On Sat, 19 Dec 2009, Harti Brandt wrote:
BE>
BE>> On Sun, 20 Dec 2009, Bruce Evans wrote:
BE>> 
BE>> [... complications]
BE>> 
BE>> To be honest, I'm lost now. Couldn't we just use the largest atomic type
BE>> for the given platform and atomic_inc/atomic_add/atomic_fetch and handle
BE>> the 32->64 bit stuff (for IA32) as I do it in bsnmp, but as a kernel
BE>> thread?
BE>
BE>That's probably best (except without the atomic operations) (like I said
BE>originally.  I tried to spell out the complications to make it clear that
BE>they would be too much except for incomplete ones).
BE>
BE>> Are the 5-6 atomic operations really that costly given the many operations
BE>> done on an IP packet? Are they more costly than a heavyweight sync for
BE>> each ++ or +=?
BE>
BE>rwatson found that even non-atomic operations are quite costly, since
BE>at least on amd64 and i386, ones that write (or any access?) the same
BE>address (or cache line?) apparently involve much the same hardware
BE>activity (cache snoop?) as atomic ones implemented by locking the bus.
BE>I think this is mostly historical -- it should be necessary to lock the
BE>bus to get the slow version.  Per-CPU counters give separate addresses
BE>and also don't require the bus lock.  I don't like the complexity for
BE>per-CPU counters but don't use big SMP systems enough to know what the
BE>locks cost in real applications.
BE>
BE>> Or we could use the PCPU stuff, use just ++ and += for modifying the
BE>> statistics (32bit) and do the 32->64 bit stuff for all platforms with a
BE>> kernel thread per CPU (do we have this?). Between that thread and the
BE>> sysctl we could use a heavy sync.
BE>
BE>I don't like the squillions of threads in FreeBSD-post-4, but this seems
BE>to need its own one and there isn't one yet AFAIK.  I think a thread is
BE>only needed for the 32-bit stuff (since aggregation has to use the
BE>current values and it shouldn't have to ask a thread to sum them).  The
BE>thread should maintain only the high 32 or 33 bits of the 64-bit counters.
BE>Maybe there should be a thread per CPU (ugh) with per-CPU extra bits so
BE>that these bits can be accessed without locking.  The synchronization is
BE>still interesting.
BE>
BE>> Or we could use PCPU and atomic_inc/atomic_add/atomic_fetch with the
BE>> largest atomic type for the platform, handle the aggregation and (on IA32)
BE>> the 32->64 bit stuff in a kernel thread.
BE>
BE>I don't see why using atomic or locks for just the 64 bit counters is good.
BE>We will probably end up with too many 64-bit counters, especially if they
BE>don't cost much when not read.

On a 32-bit arch when reading a 32-bit value on one CPU while the other CPU is
modifying it, the read will probably be always correct given the variable is
correctly aligned. On a 64-bit arch when reading a 64-bit value on one CPU
while the other one is adding to, do I always get the correct value? I'm
not sure about this, why I put atomic_*() there assuming that they will make
this correct.

The idea is (for 32-bit platforms):

struct pcpu_stats {
	uint32_t in_bytes;
	uint32_t in_packets;
};

struct pcpu_hc_stats {
	uint64_t hc_in_bytes;
	uint64_t hc_in_packets;
};

/* driver; IP stack; ... */
...
pcpu_stats->in_bytes += bytes;
pcpu_stats->in_packets++;
...

/* per CPU kernel thread for 32-bit arch */
lock(pcpu_hc_stats);
...
val = pcpu_stats->in_bytes;
if ((uint32_t)pcpu_hc_stats->hc_in_bytes > val)
	pcpu_hc_stats->in_bytes += 0x100000000;
pcpu_hc_stats->in_bytes = (pcpu_hc_stats->in_bytes &
    0xffffffff00000000ULL) | val;
...
unlock(pcpu_hc_stats);

/* sysctl */

memset(&stats, 0, sizeof(stats));
foreach(cpu) {
	lock(pcpu_hc_stats(cpu));
	...
	stats.in_bytes += pcpu_hc_stats(cpu)->hc_in_bytes;
	...
	unlock(pcpu_hc_stats(cpu));
}
copyout(stats);

On 64-bit archs we can go without the locks and the thread given that we
can reliably read the 64-bit per CPU numbers (can we?).

BE>I just thought of another implementation to reduce reads: trap on
BE>overflow and handle all the complications in the trap handler, or
BE>just set a flag to tell the fixup thread to run and normally don't
BE>run the fixup thread.  This seems to not quite work -- arranging
BE>for the trap would be costly (needs "into" instruction on i386?).
BE>Similarly for explicit tests for wraparound (PCPU_INC() could be a
BE>function call that does the test and handles wraparound in a fully
BE>locked fashion.  We don't care that this code executes slowly since
BE>it rarely executes, but we care that the test pessimizes the usual
BE>case).
BE>
BE>There is also "lock cmpxchg8b" on i386.  I think this can be used in a
BE>loop to implement atomic 64-bit ops (?).  Simpler, but slower in
BE>PCPU_INC().  I prefer a function call version of PCPU_INC() to this.
BE>That should be faster in the usual case and only much larger if we
BE>have too many 64-bit counters.
BE>
BE>> Using 32 bit stats may fail if you put in several 10GBit/s adapters into a
BE>> machine and do routing at link speed, though. This might overflow the IP
BE>> input/output byte counter (which we don't have yet) too fast.
BE>
BE>Not with a mere 10GB/S.  That's ~1GB/S so it takes 4 seconds to overflow
BE>a 32-bit byte counter.  A bit counter would take a while to overflow too.
BE>Are there any faster incrementors?  TSCs also take O(1) seconds to overflow,
BE>and timecounter logic depends on no timecounter overflowing much faster
BE>than that.

If you have 4 10GBit/s adapters each operating full-duplex at link speed you
wrap in under 0.5 seconds, maybe even faster if you have some kind of tunnels
where each packet counts several times. But I suppose this will be not so easy
with IA32 to implement :-)

harti