network statistics in SMP

Bruce Evans brde at
Sun Dec 20 20:18:01 UTC 2009

On Sat, 19 Dec 2009, Harti Brandt wrote:

> On Sun, 20 Dec 2009, Bruce Evans wrote:

> BE>...
> BE>I don't see why using atomic or locks for just the 64 bit counters is good.
> BE>We will probably end up with too many 64-bit counters, especially if they
> BE>don't cost much when not read.
> On a 32-bit arch when reading a 32-bit value on one CPU while the other CPU is
> modifying it, the read will probably be always correct given the variable is
> correctly aligned.

We assume this.  Otherwise the per-CPU counter optimization wouldn't work
so well.

> On a 64-bit arch when reading a 64-bit value on one CPU
> while the other one is adding to, do I always get the correct value? I'm
> not sure about this, why I put atomic_*() there assuming that they will make
> this correct.

You have to use the PCPU_INC()/PCPU_GET() interface and not worry about
this.  It must supply any necessary atomicness on an arch-specific

> The idea is (for 32-bit platforms):
> struct pcpu_stats {
> 	uint32_t in_bytes;
> 	uint32_t in_packets;
> };
> struct pcpu_hc_stats {
> 	uint64_t hc_in_bytes;
> 	uint64_t hc_in_packets;
> };
> /* driver; IP stack; ... */
> ...
> pcpu_stats->in_bytes += bytes;
> pcpu_stats->in_packets++;
> ...
> /* per CPU kernel thread for 32-bit arch */
> lock(pcpu_hc_stats);
> ...
> val = pcpu_stats->in_bytes;
> if ((uint32_t)pcpu_hc_stats->hc_in_bytes > val)
> 	pcpu_hc_stats->in_bytes += 0x100000000;
> pcpu_hc_stats->in_bytes = (pcpu_hc_stats->in_bytes &
>    0xffffffff00000000ULL) | val;

Why not just `pcpu_hc_stats->in_bytes |= val' for the second statement?

> ...
> unlock(pcpu_hc_stats);

> BE>> Using 32 bit stats may fail if you put in several 10GBit/s adapters into a
> BE>> machine and do routing at link speed, though. This might overflow the IP
> BE>> input/output byte counter (which we don't have yet) too fast.
> BE>
> BE>Not with a mere 10GB/S.  That's ~1GB/S so it takes 4 seconds to overflow
> BE>a 32-bit byte counter.  A bit counter would take a while to overflow too.
> BE>Are there any faster incrementors?  TSCs also take O(1) seconds to overflow,
> BE>and timecounter logic depends on no timecounter overflowing much faster
> BE>than that.
> If you have 4 10GBit/s adapters each operating full-duplex at link speed you
> wrap in under 0.5 seconds, maybe even faster if you have some kind of tunnels
> where each packet counts several times. But I suppose this will be not so easy
> with IA32 to implement :-)

I was only thinking of per-interface counters, but we can handle the
64-bit fixup for thousands of these if we can handle thousands of fixup
threads each running thousands of times per second.  Actually, thousands
of adaptors would need thousands of CPUS to handle and each per-CPU
counter would be limited by what 1 CPU could handle so we're back nearer
to the original 4 seconds than to 4/4000.


More information about the freebsd-arch mailing list