network statistics in SMP

Sat Dec 19 15:30:13 UTC 2009

On Sat, 19 Dec 2009, Hans Petter Selasky wrote:

> On Saturday 19 December 2009 12:27:12 Ulrich Spörlein wrote:
>> On Tue, 15.12.2009 at 13:13:28 -0500, John Baldwin wrote:
>>> On Tuesday 15 December 2009 12:45:13 pm Harti Brandt wrote:
>>>> I see. I was also thinking along these lines, but was not sure whether
>>>> it is worth the trouble. I suppose this does not help to implement
>>>> 64-bit counters on 32-bit architectures, though, because you cannot
>>>> read them reliably without locking to sum them up, right?
>>>
>>> Either that or you just accept that you have a small race since it is
>>> only stats. :)
>>
>> This might be stupid, but can we not easily *read* 64bit counters
>> on 32bit machines like this:
>>
>> do {
>>     h1 = read_upper_32bits;
>>     l1 = read_lower_32bits;
>>     h2 = read_upper_32bits;
>>     l2 = read_lower_32bits; /* not needed */
>> } while (h1 != h2);

No.  See my previous^N reply, but don't see it about this since it was
wrong about this for all N :-).

> Just a comment. You know you don't need a while loop to get a stable value?
> Should be implemented like this, in my opinion:
> 
> h1 = read_upper_32bits;
> l1 = read_lower_32bits;
> h2 = read_upper_32bits;
>
> if (h1 != h2)
> 	l1 = 0xffffffffUL;
>
> sum64 = (h1<<32) | l1;

Also wrong :-).  Apart from write ordering problems (1), the write of
the second half (presumably the top half) might not have completed 
when the above looks at it.  Then both of the above will see:
     h1 = old value
     l1 = new value
     h2 = old value (since the new one has not been written yet).
The race window for this can be arbitrarily long, since the second write
can be delayed for arbitrarily long (e.g., by interrupt handling).  Even
if we ensure write ordering and no interrupts, the above has many problems.
- we can't reasonably guarantee that the reads of l1 and h2 will execute
   sufficiently faster than the writes of l1 and h1/h2 so that the above
   will see h2 after l1.  I think the writes will usually go slightly
   faster since they will go through a write buffer, provided the 2 halves
   are in a single cache line, but this is unclear.  SMP with different
   CPU frequencies is not really supported, but works now modulo timecounter
   problems, and we probably want to support completely independent CPU
   frequencies, with some CPUs throttled.
- I don't understand why the above compares the high values.  Comparing the
   low values seems to work better.
- I can't see how to fix up the second method to be useful.  It is faster,
   but some delays seem to be necessary and they might as well be partly
   in a slower method.

Another try:
- enforce write ordering in writers and read ordering in the reader
- make sure that the reader runs fast enough.  This might require using
   critical_enter() in the reader.
- then many cases work with no complications:
   (a) if l1 is not small and not large, then h1 must be associated with l1,
       since then the low value can't roll over while we are looking
       at the pair, so the high value can't change while we are looking.
       So we can just use h1 and l1, without reading h2 or l2.
   (b) similarly, if l1 is small, then h2 is associated with l1.  So we
       can just use h2 with l1, without reading l2.
- otherwise, l1 is large:
   (c) if l1 = l2, then h1 must be associated with l1, since some writes
       of the high value associated with writing not-so-large low values
       have had plenty of time to complete (in fact, the one for (l1-2)
       or (l2-1) must have completed, and "large" only needs to be 1
       or 2 to ensure that these values don't wrap.  E.g., if l1 is
       0xFFFFFFFF, then it is about to wrap, but it certainly hasn't
       wrapped recently so h1 hasn't incremented recently.  So we can
       use h1 with l1, after reading l2 (still need to read h2, in case
       we don't get here).
   (d) if l1 != l2, then ordering implies that the write of the high value
       associated with l1 has completed.  We might have missed reading
       this value, since we might have read h1 too early and might have
       read h2 too late, but in the usual case h1 == h2 and then both
       h's are associated with l1, while if h1 != h2 then we can loop
       again and surely find h1 == h2 (and l1 small, so case (c)), or
       we can use the second method.  We had to read all 4 values to
       determine what to do here, and can usually use 2 of them directly.

(1) Write ordering is guaranteed on amd64 but I think not on all arches.
     You could pessimize PCPU_INC() with memory barriers on some arches to
     get write ordering.

(2) You could pessimize PCPU_INC() with critical_enter() to mostly prevent
     this.  You cannot prevent the second write from being delayed for
     arbitrarily long by a trap and its handling, except by breaking
     at least the debugger traps needed to debug this.

In my previous^2 reply, I said heavyweight synchronization combined
with extra atomic generation counters would work.  The heavyweight
synchronization would have to be heavier than I thought -- it would
have to wait for all other CPUs to complete the pairs of writes for
64-bit counters, if any, and for this it would have to do more than
IPI's -- it should change priorities and reschedule to ensure that the
half-writes (if any) have a chance of completing soon...  Far too
complicated for this.  Disabling interrupts for the non-atomic PCPU_INC()s
is probably best.  Duh, this or worse (locking) is required on the
writer side anyway, else increments in won't be atomic.  Locking would
actually automatically give the rescheduling stuff for the heavyweight
synchronizaton -- you would acquire the lock in the reader and of
course in the writers, and get priority propagation to complete the
one writer allowed to hold the lock iff any is holding it.  Locking
might not be too bad for a few 64-bit counters.  So I've changed my
mind yet again and prefer locking to critical_enter().  It's cleaner
and works for traps.  I just remembered that rwatson went the opposite
way and changed some locking to critical_enter() in UMA.  I prefer the
old way, and at least in old versions of FreeBSD I got panics trying
to debug near this (single-stepping malloc()?).

In my previous^1 reply, I said lighter weight synchronizion combined
with no extra or atomic counters (use counters as their own generation
counter) would work.  But the synchronization still needs to be heavy,
or interrupts disabled, as above.

Everything must be read before and after to test for getting a coherent
set of values, so the loop in the first method has the minimal number
of reads (for a single 64-bit counter).  With sync, the order for each
pair in it doesn't matter on either the reader or writer (there must be
a sync or 2 instead).

Bruce