svn commit: r300154 - head/sys/net

Bruce Evans brde at optusnet.com.au
Wed May 18 20:04:08 UTC 2016


On Wed, 18 May 2016, Ian Lepore wrote:

> On Wed, 2016-05-18 at 17:35 +0000, Bjoern A. Zeeb wrote:
>>> On 18 May 2016, at 17:32 , Ian Lepore <ian at freebsd.org> wrote:
>>>
>>> On Wed, 2016-05-18 at 10:14 -0700, Nathan Whitehorn wrote:
>>> ...
>>> It may be more complicated than that, though.  armv6 can do 64-bit
>>> atomics even tho it's 32-bit.  armv4, also 32-bit, can do 64-bit
>>> atomics in the kernel but not in userland.
>>>
>>> Maybe machine/atomic.h needs a #define that says whether 64-bit ops
>>> are
>>> available in the current compilation unit.  (And likewise for other
>>> bit
>>> sizes if we have arches that have other limitations.)
>>
>> Question because I didn¢t follow the details, but how was this solved
>> for the COUNTERS framework?

Using special code that pessimizes old machines on 32-bit arches especially.

For example, incrementing a 32-bit network counter used to take 1 inlined
counter++ statement
     (as little as 1 instruction on i386, but it is a read-modify-write
     instruction and thus no faster than separate instructions, and if
     the counter is shared this is almost as slow as a locked instruction
     in some cases).
This now takes an if_inc_counter() function call which takes 33 instructions
altogether on i386 with certain nonstandard not very optimal CFLAGS, and 9
instructions on amd64.
     (COUNTER64() code is inlined, and the function call is a separate
     pessimization.  It costs about half of the 9 instructions on amd64
     and its instructions are relatively heavyweight.)
This is when i386 has cmpxchgb.  The single cmpxchg8b instruction is
heavier weight than a 32-bit memory increment and using it takes lots
of control logic.

> iirc, each platform implements counters its own way, probably the wrong
> way on all of them except x86.

I think other arches just use compatiblity code which uses critical
sections.  This is not so bad.  It might be faster than using cmpxchg8b
depending on how fast critical_enter() and critical_exit() are.
Unfortunately, they are not very fast.  They are functions too, and
on i386 critical_enter() takes 20+ instructions.  critical_exit() takes
more, and debugging is broken and caused a panic when I tried to
trace through critical_exit().  That is so slow that hard-disabling
interrupts is probably faster.

Network drivers were mostly written under the assumption that they are
running on a UP system and incrementing a counter is inline and fast,
so they increment counters without worrying about the overhead for
this.  33 instructions for 2 if_inc_counter()s per packet is about a
1% pessimization for bge on my slow hardware.

> For some crazy reason the docs for COUNTERS say that it does not use
> atomics.  I have no idea why the docs for an API are dictating
> implementation, but I suspect it's because atomics are more expensive
> on x86 than other alternatives.  So the arm code slavishly avoids using
> atomics in COUNTERS even though doing so would be more effecient than
> the current copied-from-x86 code.

Other places just hard-code use of PCPU although that is also more
complicated and uglier than counter++.  Counters in PCPU only exist
because full atomics are probably slower on all arches and much slower
on most arches.  Most 64-bit PCPU accesses on 32-bit arches are broken
since they are not atomic even for 1 CPU.  COUNTER64() is more careful
to a fault.  arm PCPU_INC() seems to be broken even for 32-bit accesses.
In the non-SMP case, it just does pcpu->pc_ ## member++ and in the SMP
case it does the same with a register pointer instead of a global.

I wrote some alternative x86 implementations that are at least 20%
faster than the cmpxchg8b method, but the best method is clearly to
use only 32-bit low-level counters and add up the counters in a daemon.
The daemon shouldn't run very often.  There aren't many counters except
i/o byte counters that want to wrap in 32 bits more than once per hour.
Even 100 Gbps ethernet can only do 150 Mpps so it takes at least 30
seconds to wrap.

Bruce


More information about the svn-src-head mailing list