svn commit: r252032 - head/sys/amd64/include
Gleb Smirnoff
glebius at FreeBSD.org
Fri Jun 21 13:54:37 UTC 2013
Bruce,
On Fri, Jun 21, 2013 at 09:02:36PM +1000, Bruce Evans wrote:
B> Not if it is a 32-bit increment on 32-bit systems, as it should be.
B>
B> I said to use a daemon to convert small (16 or 32 bit) counters into
B> larger (32 or 64 bit) ones. It is almost as efficient to call the
B> accumulation code directly:
B>
B> addl %1,%fs:(%0)
B> jns 1f
B> call counter_accum
B> 1:
B>
B> This calls the accumulation code when the counter goes above 0x80000000.
B> The accumulation code then uses cmpxchg8b or a critical section to
B> atomically transfer everything in the small counter to the large
B> counter. Note that the small counter remains valid at all times unless
B> it is incrememnted too fast. Code that reports the counters must also
B> use cmpxch8b or a critical section to read them atomically before
B> summing them. The transfer window between when the small counter
B> reaches 0x80000000 and when it overflows is very large provided large
B> increments are never made (2 increments of 0x80000000 would break it
B> immediately, but an average increment of 1 takes about a minute to
B> reach 0x80000000 even if called from a loop doing nothing except the
B> increment). Only 32-bit increments that are not too large are
B> supported. Negative increments are handed by a heuristic in the
B> accumulator. E.g., an increment of -1 just works unless it causes the
B> small counter's value to go from 0 to 0xfffffffff. 0xffffffff is
B> detected by the sign test. It must be interpreted as -1 to be added
B> to the large counter, not 0xffffffff. The full heuristic would be
B> something like:
B>
B> if (small_counter >= 0xf0000000) {
B> /* Subtract from large counter. */
B> /* Actually do this atomically, of course. */
B> large_counter += (int32_t)small_counter;
B> small_counter = 0;
B> /*
B> * Add here sanity checks that the large counter never goes
B> * negative. This can only be afforded because this code
B> * is rarely called (and is not inline).
B> */
B> } else if (small_counter < 0x90000000) {
B> /* Add to large counter.
B> large_counter += small_counter;
B> small_counter = 0;
B> } else {
B> panic("transfer window too small");
B> }
B>
B> > Entering critical section means modifying curthread, which is again
B> > a %gs based load & store. Exiting critical section has the same cost.
B> > Thus, we assume that doing a direct %gs based update on the counter
B> > is cheaper than critical_enter(); counter += foo; critical_exit();
B>
B> Critical sections actually only modifiy td->td_critnest, so they do
B> a single %fs-based load and 2 %ds-based modifications (it's %fs for
B> i386 and %gs for amd64). So it's far from clear that the direct
B> update is faster than the generic i386 code. Avoiding the generic
B> i386 code also takes an extra load, test and branch. The generic
B> code reduces to something like the following instructions after inlining
B> the critical section:
B>
B> movl %gs:CURTHREAD,%ebx
B> incl TD_CRITNEST(%ebx)
B> movl counter_arg,%edi
B> movl inc_lo,%eax
B> movl inc_hi,%edx
B> addl %eax,%fs:(%edi)
B> adcl %edx,%fs:4(%edi)
B> 1:
B> decl TD_CRITNEST(%ebx)
B>
B> while the non-generic code for the best case when cmpxchg8b is supported is
B> something like:
B>
B> testl $CPUID_CX8,cpu_feature
B> je 2f
B> movl counter_arg,%esi
B> movl ptr_to_inc,%edi # might need more insns to init *ptr
B> // 8 instructions here from the asm string.
B> jmp 3f # actually rearrange to jump in other case
B> 2:
B> // Use critical section.
B> 3:
B>
B>
B> So using cmpxchg8b takes at least 4 more instructions than using a
B> critical section. The incl/decl instructions on td_critnest are rather
B> slow, however. According to my test, they take 6 cycles each. (I
B> think they can be reduced to load; modify; store; ... store back
B> original value which should take more like 9 cycles.) This probably
B> applies to the addl and adcl that do the work too. 12 cycles for them,
B> and just 2 more cycles (pipelined) for cmpxchg8b and all the other
B> setup instructions.
B>
B> Now, how does the performance aspect work? Is it by cmpxch8b causing
B> less contention than direct addl to memory? If so, then hopefully
B> plain cmpxchg does the same.
B>
B> The non-cmpxchg version of the i386 code can probably be improved by
B> adding more instructions to it, to avoid doing the adcl to memory
B> in most cases. Even if you support 64-bit increments, they will rarely
B> be used, so the branches for this will be predicted well:
B>
B> // Next 4 instructions as above:
B> movl counter_arg,%edi
B> movl inc_lo,%eax
B> movl inc_hi,%edx
B> addl %eax,%fs:(%edi)
B> jnc 1f # usually no carry
B> adcl %edx,%fs:4(%edi)
B> jmp 3f
B> 1:
B> testl %edx,%edx
B> je 3f # usually no carry and no high bits
B> addl %edx,%fs:4(%edi)
B>
B> I wonder if cmpxch8b is smart enough to do this automatically, and that is
B> the reason it is faster. It could avoid writing to the high word if the
B> high word doesn't change. Modern x86 has write buffers that help it
B> do 64-bit write accesses in 32-bit mode no slower than in 64-bit mode, provided
B> the CPU can issue store instructions fast enough. Caches do the same
B> thing for load instructions on not-so-modern x86. But often, the CPU
B> can't issue load/store instructions fast enough. I suspect that the
B> addl mem; adcl 4+mem sequence is a bad way to do a 64-bit increment, and
B> load 64 bits; modify; store 64 bits is better. cmpxchg8b can also issue
B> a 64-bit store.
B>
B> Testing gave the following results:
B> - jumping over adcl as above saves almost the full cost of an adcl
B> - load 64-bits, unconditional 64-bit add from memory to registers; store-64
B> bits is about 25% faster than addl; adcl to memory. Testing now on
B> Athlon64. addl and adcl to memory each take about 7 cycles; 14 for
B> both; only 11 for doing the 64-bit add via registers. Gcc is not
B> smart about this. It generates addl; adcl to memory. Your asm has
B> to do the increment in registers for technical reasons, so it does
B> better than gcc almost (?) accidentally. But clang is smart about
B> this. It generates the larger code that goes through registers even
B> with -mi386 where I doubt that it is better (original i386 had no
B> cache and slow I-fetch, so I think it strongly preferred less
B> instructions). Your code for the critical section case doesn't use
B> asm, so it is slower than necessary with gcc. When you found that
B> simple increments were slower in userland tests, was that with gcc
B> or clang?
B> - jumping over adcl in the second method didn't seem to work.
May be you are right and an implementation with a "daemon" would be
more efficient for i386. But daemon isn't needed at all on 64-bit
platforms. So I doubt it is worth coding it.
Frankly speaking, we already do not consider the i386 as a platform to
build high performance solutions. The need for counter(9) in TCP/IP
stack was demonstrated on modern amd64 hardware. Thus, our motivation
was to make:
- a lossless and faster implementation for amd64
- a lossless implementation for !amd64
- if we have an option to avoid critical section on some arch, good
let's avoid it.
--
Totus tuus, Glebius.
More information about the svn-src-all
mailing list