network statistics in SMP

Sat Dec 19 16:02:29 UTC 2009

On Sun, 20 Dec 2009, Bruce Evans wrote:

BE>On Sat, 19 Dec 2009, Hans Petter Selasky wrote:
BE>
BE>> On Saturday 19 December 2009 12:27:12 Ulrich SpЖrlein wrote:
BE>> > On Tue, 15.12.2009 at 13:13:28 -0500, John Baldwin wrote:
BE>> > > On Tuesday 15 December 2009 12:45:13 pm Harti Brandt wrote:
BE>> > > > I see. I was also thinking along these lines, but was not sure
BE>> > > > whether
BE>> > > > it is worth the trouble. I suppose this does not help to implement
BE>> > > > 64-bit counters on 32-bit architectures, though, because you cannot
BE>> > > > read them reliably without locking to sum them up, right?
BE>> > > 
BE>> > > Either that or you just accept that you have a small race since it is
BE>> > > only stats. :)
BE>> > 
BE>> > This might be stupid, but can we not easily *read* 64bit counters
BE>> > on 32bit machines like this:
BE>> > 
BE>> > do {
BE>> >     h1 = read_upper_32bits;
BE>> >     l1 = read_lower_32bits;
BE>> >     h2 = read_upper_32bits;
BE>> >     l2 = read_lower_32bits; /* not needed */
BE>> > } while (h1 != h2);
BE>
BE>No.  See my previous^N reply, but don't see it about this since it was
BE>wrong about this for all N :-).
BE>
BE>> Just a comment. You know you don't need a while loop to get a stable value?
BE>> Should be implemented like this, in my opinion:
BE>> 
BE>> h1 = read_upper_32bits;
BE>> l1 = read_lower_32bits;
BE>> h2 = read_upper_32bits;
BE>> 
BE>> if (h1 != h2)
BE>> 	l1 = 0xffffffffUL;
BE>> 
BE>> sum64 = (h1<<32) | l1;
BE>
BE>Also wrong :-).  Apart from write ordering problems (1), the write of
BE>the second half (presumably the top half) might not have completed when the
BE>above looks at it.  Then both of the above will see:
BE>    h1 = old value
BE>    l1 = new value
BE>    h2 = old value (since the new one has not been written yet).
BE>The race window for this can be arbitrarily long, since the second write
BE>can be delayed for arbitrarily long (e.g., by interrupt handling).  Even
BE>if we ensure write ordering and no interrupts, the above has many problems.
BE>- we can't reasonably guarantee that the reads of l1 and h2 will execute
BE>  sufficiently faster than the writes of l1 and h1/h2 so that the above
BE>  will see h2 after l1.  I think the writes will usually go slightly
BE>  faster since they will go through a write buffer, provided the 2 halves
BE>  are in a single cache line, but this is unclear.  SMP with different
BE>  CPU frequencies is not really supported, but works now modulo timecounter
BE>  problems, and we probably want to support completely independent CPU
BE>  frequencies, with some CPUs throttled.
BE>- I don't understand why the above compares the high values.  Comparing the
BE>  low values seems to work better.
BE>- I can't see how to fix up the second method to be useful.  It is faster,
BE>  but some delays seem to be necessary and they might as well be partly
BE>  in a slower method.
BE>
BE>Another try:
BE>- enforce write ordering in writers and read ordering in the reader
BE>- make sure that the reader runs fast enough.  This might require using
BE>  critical_enter() in the reader.
BE>- then many cases work with no complications:
BE>  (a) if l1 is not small and not large, then h1 must be associated with l1,
BE>      since then the low value can't roll over while we are looking
BE>      at the pair, so the high value can't change while we are looking.
BE>      So we can just use h1 and l1, without reading h2 or l2.
BE>  (b) similarly, if l1 is small, then h2 is associated with l1.  So we
BE>      can just use h2 with l1, without reading l2.
BE>- otherwise, l1 is large:
BE>  (c) if l1 = l2, then h1 must be associated with l1, since some writes
BE>      of the high value associated with writing not-so-large low values
BE>      have had plenty of time to complete (in fact, the one for (l1-2)
BE>      or (l2-1) must have completed, and "large" only needs to be 1
BE>      or 2 to ensure that these values don't wrap.  E.g., if l1 is
BE>      0xFFFFFFFF, then it is about to wrap, but it certainly hasn't
BE>      wrapped recently so h1 hasn't incremented recently.  So we can
BE>      use h1 with l1, after reading l2 (still need to read h2, in case
BE>      we don't get here).
BE>  (d) if l1 != l2, then ordering implies that the write of the high value
BE>      associated with l1 has completed.  We might have missed reading
BE>      this value, since we might have read h1 too early and might have
BE>      read h2 too late, but in the usual case h1 == h2 and then both
BE>      h's are associated with l1, while if h1 != h2 then we can loop
BE>      again and surely find h1 == h2 (and l1 small, so case (c)), or
BE>      we can use the second method.  We had to read all 4 values to
BE>      determine what to do here, and can usually use 2 of them directly.
BE>
BE>(1) Write ordering is guaranteed on amd64 but I think not on all arches.
BE>    You could pessimize PCPU_INC() with memory barriers on some arches to
BE>    get write ordering.
BE>
BE>(2) You could pessimize PCPU_INC() with critical_enter() to mostly prevent
BE>    this.  You cannot prevent the second write from being delayed for
BE>    arbitrarily long by a trap and its handling, except by breaking
BE>    at least the debugger traps needed to debug this.
BE>
BE>In my previous^2 reply, I said heavyweight synchronization combined
BE>with extra atomic generation counters would work.  The heavyweight
BE>synchronization would have to be heavier than I thought -- it would
BE>have to wait for all other CPUs to complete the pairs of writes for
BE>64-bit counters, if any, and for this it would have to do more than
BE>IPI's -- it should change priorities and reschedule to ensure that the
BE>half-writes (if any) have a chance of completing soon...  Far too
BE>complicated for this.  Disabling interrupts for the non-atomic PCPU_INC()s
BE>is probably best.  Duh, this or worse (locking) is required on the
BE>writer side anyway, else increments in won't be atomic.  Locking would
BE>actually automatically give the rescheduling stuff for the heavyweight
BE>synchronizaton -- you would acquire the lock in the reader and of
BE>course in the writers, and get priority propagation to complete the
BE>one writer allowed to hold the lock iff any is holding it.  Locking
BE>might not be too bad for a few 64-bit counters.  So I've changed my
BE>mind yet again and prefer locking to critical_enter().  It's cleaner
BE>and works for traps.  I just remembered that rwatson went the opposite
BE>way and changed some locking to critical_enter() in UMA.  I prefer the
BE>old way, and at least in old versions of FreeBSD I got panics trying
BE>to debug near this (single-stepping malloc()?).
BE>
BE>In my previous^1 reply, I said lighter weight synchronizion combined
BE>with no extra or atomic counters (use counters as their own generation
BE>counter) would work.  But the synchronization still needs to be heavy,
BE>or interrupts disabled, as above.
BE>
BE>Everything must be read before and after to test for getting a coherent
BE>set of values, so the loop in the first method has the minimal number
BE>of reads (for a single 64-bit counter).  With sync, the order for each
BE>pair in it doesn't matter on either the reader or writer (there must be
BE>a sync or 2 instead).

To be honest, I'm lost now. Couldn't we just use the largest atomic type 
for the given platform and atomic_inc/atomic_add/atomic_fetch and handle 
the 32->64 bit stuff (for IA32) as I do it in bsnmp, but as a kernel 
thread?

Are the 5-6 atomic operations really that costly given the many operations 
done on an IP packet? Are they more costly than a heavyweight sync for 
each ++ or +=?

Or we could use the PCPU stuff, use just ++ and += for modifying the 
statistics (32bit) and do the 32->64 bit stuff for all platforms with a 
kernel thread per CPU (do we have this?). Between that thread and the 
sysctl we could use a heavy sync.

Or we could use PCPU and atomic_inc/atomic_add/atomic_fetch with the 
largest atomic type for the platform, handle the aggregation and (on IA32) 
the 32->64 bit stuff in a kernel thread.

Using 32 bit stats may fail if you put in several 10GBit/s adapters into a 
machine and do routing at link speed, though. This might overflow the IP 
input/output byte counter (which we don't have yet) too fast.

harti