network statistics in SMP
Harti Brandt
hartmut.brandt at dlr.de
Sat Dec 19 16:02:29 UTC 2009
On Sun, 20 Dec 2009, Bruce Evans wrote:
BE>On Sat, 19 Dec 2009, Hans Petter Selasky wrote:
BE>
BE>> On Saturday 19 December 2009 12:27:12 Ulrich SpЖrlein wrote:
BE>> > On Tue, 15.12.2009 at 13:13:28 -0500, John Baldwin wrote:
BE>> > > On Tuesday 15 December 2009 12:45:13 pm Harti Brandt wrote:
BE>> > > > I see. I was also thinking along these lines, but was not sure
BE>> > > > whether
BE>> > > > it is worth the trouble. I suppose this does not help to implement
BE>> > > > 64-bit counters on 32-bit architectures, though, because you cannot
BE>> > > > read them reliably without locking to sum them up, right?
BE>> > >
BE>> > > Either that or you just accept that you have a small race since it is
BE>> > > only stats. :)
BE>> >
BE>> > This might be stupid, but can we not easily *read* 64bit counters
BE>> > on 32bit machines like this:
BE>> >
BE>> > do {
BE>> > h1 = read_upper_32bits;
BE>> > l1 = read_lower_32bits;
BE>> > h2 = read_upper_32bits;
BE>> > l2 = read_lower_32bits; /* not needed */
BE>> > } while (h1 != h2);
BE>
BE>No. See my previous^N reply, but don't see it about this since it was
BE>wrong about this for all N :-).
BE>
BE>> Just a comment. You know you don't need a while loop to get a stable value?
BE>> Should be implemented like this, in my opinion:
BE>>
BE>> h1 = read_upper_32bits;
BE>> l1 = read_lower_32bits;
BE>> h2 = read_upper_32bits;
BE>>
BE>> if (h1 != h2)
BE>> l1 = 0xffffffffUL;
BE>>
BE>> sum64 = (h1<<32) | l1;
BE>
BE>Also wrong :-). Apart from write ordering problems (1), the write of
BE>the second half (presumably the top half) might not have completed when the
BE>above looks at it. Then both of the above will see:
BE> h1 = old value
BE> l1 = new value
BE> h2 = old value (since the new one has not been written yet).
BE>The race window for this can be arbitrarily long, since the second write
BE>can be delayed for arbitrarily long (e.g., by interrupt handling). Even
BE>if we ensure write ordering and no interrupts, the above has many problems.
BE>- we can't reasonably guarantee that the reads of l1 and h2 will execute
BE> sufficiently faster than the writes of l1 and h1/h2 so that the above
BE> will see h2 after l1. I think the writes will usually go slightly
BE> faster since they will go through a write buffer, provided the 2 halves
BE> are in a single cache line, but this is unclear. SMP with different
BE> CPU frequencies is not really supported, but works now modulo timecounter
BE> problems, and we probably want to support completely independent CPU
BE> frequencies, with some CPUs throttled.
BE>- I don't understand why the above compares the high values. Comparing the
BE> low values seems to work better.
BE>- I can't see how to fix up the second method to be useful. It is faster,
BE> but some delays seem to be necessary and they might as well be partly
BE> in a slower method.
BE>
BE>Another try:
BE>- enforce write ordering in writers and read ordering in the reader
BE>- make sure that the reader runs fast enough. This might require using
BE> critical_enter() in the reader.
BE>- then many cases work with no complications:
BE> (a) if l1 is not small and not large, then h1 must be associated with l1,
BE> since then the low value can't roll over while we are looking
BE> at the pair, so the high value can't change while we are looking.
BE> So we can just use h1 and l1, without reading h2 or l2.
BE> (b) similarly, if l1 is small, then h2 is associated with l1. So we
BE> can just use h2 with l1, without reading l2.
BE>- otherwise, l1 is large:
BE> (c) if l1 = l2, then h1 must be associated with l1, since some writes
BE> of the high value associated with writing not-so-large low values
BE> have had plenty of time to complete (in fact, the one for (l1-2)
BE> or (l2-1) must have completed, and "large" only needs to be 1
BE> or 2 to ensure that these values don't wrap. E.g., if l1 is
BE> 0xFFFFFFFF, then it is about to wrap, but it certainly hasn't
BE> wrapped recently so h1 hasn't incremented recently. So we can
BE> use h1 with l1, after reading l2 (still need to read h2, in case
BE> we don't get here).
BE> (d) if l1 != l2, then ordering implies that the write of the high value
BE> associated with l1 has completed. We might have missed reading
BE> this value, since we might have read h1 too early and might have
BE> read h2 too late, but in the usual case h1 == h2 and then both
BE> h's are associated with l1, while if h1 != h2 then we can loop
BE> again and surely find h1 == h2 (and l1 small, so case (c)), or
BE> we can use the second method. We had to read all 4 values to
BE> determine what to do here, and can usually use 2 of them directly.
BE>
BE>(1) Write ordering is guaranteed on amd64 but I think not on all arches.
BE> You could pessimize PCPU_INC() with memory barriers on some arches to
BE> get write ordering.
BE>
BE>(2) You could pessimize PCPU_INC() with critical_enter() to mostly prevent
BE> this. You cannot prevent the second write from being delayed for
BE> arbitrarily long by a trap and its handling, except by breaking
BE> at least the debugger traps needed to debug this.
BE>
BE>In my previous^2 reply, I said heavyweight synchronization combined
BE>with extra atomic generation counters would work. The heavyweight
BE>synchronization would have to be heavier than I thought -- it would
BE>have to wait for all other CPUs to complete the pairs of writes for
BE>64-bit counters, if any, and for this it would have to do more than
BE>IPI's -- it should change priorities and reschedule to ensure that the
BE>half-writes (if any) have a chance of completing soon... Far too
BE>complicated for this. Disabling interrupts for the non-atomic PCPU_INC()s
BE>is probably best. Duh, this or worse (locking) is required on the
BE>writer side anyway, else increments in won't be atomic. Locking would
BE>actually automatically give the rescheduling stuff for the heavyweight
BE>synchronizaton -- you would acquire the lock in the reader and of
BE>course in the writers, and get priority propagation to complete the
BE>one writer allowed to hold the lock iff any is holding it. Locking
BE>might not be too bad for a few 64-bit counters. So I've changed my
BE>mind yet again and prefer locking to critical_enter(). It's cleaner
BE>and works for traps. I just remembered that rwatson went the opposite
BE>way and changed some locking to critical_enter() in UMA. I prefer the
BE>old way, and at least in old versions of FreeBSD I got panics trying
BE>to debug near this (single-stepping malloc()?).
BE>
BE>In my previous^1 reply, I said lighter weight synchronizion combined
BE>with no extra or atomic counters (use counters as their own generation
BE>counter) would work. But the synchronization still needs to be heavy,
BE>or interrupts disabled, as above.
BE>
BE>Everything must be read before and after to test for getting a coherent
BE>set of values, so the loop in the first method has the minimal number
BE>of reads (for a single 64-bit counter). With sync, the order for each
BE>pair in it doesn't matter on either the reader or writer (there must be
BE>a sync or 2 instead).
To be honest, I'm lost now. Couldn't we just use the largest atomic type
for the given platform and atomic_inc/atomic_add/atomic_fetch and handle
the 32->64 bit stuff (for IA32) as I do it in bsnmp, but as a kernel
thread?
Are the 5-6 atomic operations really that costly given the many operations
done on an IP packet? Are they more costly than a heavyweight sync for
each ++ or +=?
Or we could use the PCPU stuff, use just ++ and += for modifying the
statistics (32bit) and do the 32->64 bit stuff for all platforms with a
kernel thread per CPU (do we have this?). Between that thread and the
sysctl we could use a heavy sync.
Or we could use PCPU and atomic_inc/atomic_add/atomic_fetch with the
largest atomic type for the platform, handle the aggregation and (on IA32)
the 32->64 bit stuff in a kernel thread.
Using 32 bit stats may fail if you put in several 10GBit/s adapters into a
machine and do routing at link speed, though. This might overflow the IP
input/output byte counter (which we don't have yet) too fast.
harti
More information about the freebsd-arch
mailing list