network statistics in SMP

Sat Dec 19 17:15:51 UTC 2009

On Sat, 19 Dec 2009, Harti Brandt wrote:

> On Sun, 20 Dec 2009, Bruce Evans wrote:
>
> [... complications]
>
> To be honest, I'm lost now. Couldn't we just use the largest atomic type
> for the given platform and atomic_inc/atomic_add/atomic_fetch and handle
> the 32->64 bit stuff (for IA32) as I do it in bsnmp, but as a kernel
> thread?

That's probably best (except without the atomic operations) (like I said
originally.  I tried to spell out the complications to make it clear that
they would be too much except for incomplete ones).

> Are the 5-6 atomic operations really that costly given the many operations
> done on an IP packet? Are they more costly than a heavyweight sync for
> each ++ or +=?

rwatson found that even non-atomic operations are quite costly, since
at least on amd64 and i386, ones that write (or any access?) the same
address (or cache line?) apparently involve much the same hardware
activity (cache snoop?) as atomic ones implemented by locking the bus.
I think this is mostly historical -- it should be necessary to lock the
bus to get the slow version.  Per-CPU counters give separate addresses
and also don't require the bus lock.  I don't like the complexity for
per-CPU counters but don't use big SMP systems enough to know what the
locks cost in real applications.

> Or we could use the PCPU stuff, use just ++ and += for modifying the
> statistics (32bit) and do the 32->64 bit stuff for all platforms with a
> kernel thread per CPU (do we have this?). Between that thread and the
> sysctl we could use a heavy sync.

I don't like the squillions of threads in FreeBSD-post-4, but this seems
to need its own one and there isn't one yet AFAIK.  I think a thread is
only needed for the 32-bit stuff (since aggregation has to use the
current values and it shouldn't have to ask a thread to sum them).  The
thread should maintain only the high 32 or 33 bits of the 64-bit counters.
Maybe there should be a thread per CPU (ugh) with per-CPU extra bits so
that these bits can be accessed without locking.  The synchronization is
still interesting.

> Or we could use PCPU and atomic_inc/atomic_add/atomic_fetch with the
> largest atomic type for the platform, handle the aggregation and (on IA32)
> the 32->64 bit stuff in a kernel thread.

I don't see why using atomic or locks for just the 64 bit counters is good.
We will probably end up with too many 64-bit counters, especially if they
don't cost much when not read.

I just thought of another implementation to reduce reads: trap on
overflow and handle all the complications in the trap handler, or
just set a flag to tell the fixup thread to run and normally don't
run the fixup thread.  This seems to not quite work -- arranging
for the trap would be costly (needs "into" instruction on i386?).
Similarly for explicit tests for wraparound (PCPU_INC() could be a
function call that does the test and handles wraparound in a fully
locked fashion.  We don't care that this code executes slowly since
it rarely executes, but we care that the test pessimizes the usual
case).

There is also "lock cmpxchg8b" on i386.  I think this can be used in a
loop to implement atomic 64-bit ops (?).  Simpler, but slower in
PCPU_INC().  I prefer a function call version of PCPU_INC() to this.
That should be faster in the usual case and only much larger if we
have too many 64-bit counters.

> Using 32 bit stats may fail if you put in several 10GBit/s adapters into a
> machine and do routing at link speed, though. This might overflow the IP
> input/output byte counter (which we don't have yet) too fast.

Not with a mere 10GB/S.  That's ~1GB/S so it takes 4 seconds to overflow
a 32-bit byte counter.  A bit counter would take a while to overflow too.
Are there any faster incrementors?  TSCs also take O(1) seconds to overflow,
and timecounter logic depends on no timecounter overflowing much faster
than that.

Bruce