cvs commit: src/sys/sparc64/include in_cksum.h

Sun Jun 29 03:25:42 UTC 2008

On Sat, 28 Jun 2008, Christoph Mallon wrote:

> Bruce Evans wrote:
>> On Sat, 28 Jun 2008, Christoph Mallon wrote:
>> 
>>> I still think, using __volatile only works by accident.  volatile for an 
>>> assembler block mostly means "this asm statement has an effect, even 
>>> though the register specification looks otherwise, so do not optimise this 
>>> away (i.e. no CSE, do not remove if result is unused etc.).
>> 
>> Right.  Though I've never seen unnecessary's __volatiles significantly
>> affecting i386 code.  This is because the code in the asms can't be
>> removed completely, and can't be moved much either.  With out of order
>> execution, the type of moves that are permitted (not across dependencies)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>> are precisely the type of moves that the CPU's scheduler can do or undo
>> no matter how the compiler orders the code.
>
> I disagree. For example look at the use of in_addword() in dev/sk/if_sk.cv in 
> line 2819:
>  csum1 = htons(csum & 0xffff);
>  csum2 = htons((csum >> 16) & 0xffff);
>  ipcsum = in_addword(csum1, ~csum2 & 0xffff);
>  /* checksum fixup for IP options */
>  len = hlen - sizeof(struct ip);
>  if (len > 0) {
>    return;
>  }
>
> The calculation will be executed even if the following if (len > 0) leaves 
> the function and the value of ipcsum is unused.
> If in_addword() is not marked volatile it can be moved after the if and not 
> be executed in all cases. csum1 and csum2 can be moved after the if, too.

No, volatile has no effect on whether the above calculation will be
executed, since the early return has no dependencies on the caclulation.
Old versions of gcc used to handle volatile like that, but this changed
in gcc-3 or earlier.  gcc.info now says:

% The `volatile' keyword indicates that the instruction has important
% side-effects.  GCC will not delete a volatile `asm' if it is reachable.
                                                       ^^^^^^^^^^^^^^^^^^^
% (The instruction can still be deleted if GCC can prove that
% control-flow will never reach the location of the instruction.)  Note
% that even a volatile `asm' instruction can be moved relative to other
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
% code, including across jump instructions.  For example, on many targets
   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Even if gcc didn't move the caclulation, then CPUs with out of order
execution might schedule it so that it is effectively never executed
(most likely by executing it in otherwise-unused pipelines while the
main pipeline returns).  This is valid for the same reasons that gcc
can move the volatile asms -- the return doesn't depend on the result
of the caclulation.

The above C code is fairly bad, but generates not so bad code on i386:

% % 	movl	%esi, %eax
% #APP
% 	xchgb %ah, %al		# byte operations can be slow; this one not
 				# too bad, but I wonder if rorw $8 is better
 				# (rorl $16 is already used for corresponding
 				# 32-bit operations) where there is no xchg
 				# alternative
% #NO_APP
% 	shrl	$16, %esi
% 	movl	%esi, %edx
% #APP
% 	xchgb %dh, %dl		# as above
% #NO_APP
% 	notl	%edx		# poor asm code -- the top 16 bits are unused
 				# except here to stall for merging them with
 				# the previous byte operation
% 	movzwl	%ax, %eax
% #APP
% 	addw %dx, %ax
% adcw $0, %ax
% #NO_APP
% 	movl	%eax, %edx

>>> On a related note: Is inline assembler really necessary here? For example 
>>> couldn't in_addword() be written as
>>> static __inline u_short
>>> in_addword(u_short const sum, u_short const b)
>>> {
>>>    u_int const t = sum + b;
>>>    return t + (t >> 16);
>>> } ?
>>> This should at least produce equally good code and because the compiler 
>>> has more knowledge about it than an assembler block, it potentially leads 
>>> to better code. I have no SPARC compiler at hand, though.
>> 
>> Last time I tried on i386, I couldn't get gcc to generate operations
>> involving carries for things like this, or the bswap instruction from
>> C code to reorder a word.  gcc4.2 -O3  on i386 now generates for the above:
>> 
>>     movzwl    b, %eax        # starting from b and sum in memory
>>     movzwl    sum, %edx
>>     addl    %eax, %edx    # 32-bit add
>>     movl    %edx, %eax
>>     shrl    $16, %eax    # it does the shift laboriously
>>     addl    %edx, %eax
>>     movzwl    %ax, %eax    # don't really need 32-bit result
>>                 # but need something to discard the high bits
>
> If the upper 16 bits are not "looked at" then the final movzwl can be 
> optimised away. Many instructions, like add, shl and mul, can live with 
> "garbage" in the upper 16 bits.

This depends on whether the bits are "looked at".  In general on i386 and
amd64, operating on garbage in the top bits is a a pessimization.  It
causes "partial register stalls" on some CPUs starting with about PPro.
This slowness is smaller on newer CPUs starting with about Athlon from
Amd and PentiumM (?) from Intel.  But Athlons have similar stalls for
mismatched sizes in loads after stores.  OTOH, movzwl is fast starting
with about the same genration of CPUs that has partial register stalls.

> Only if a "bad" instruction, like shr or div, 
> is encountered, the upper 16 bits have to be cleared.
> The current x86 implementation of in_addword() using inline assembler causes 
> the compiler to add a movzwl, too, before the return.

It is the compiler doing this, presumably because something needs a full
32-bit word.  Everything in the asm and C code in the inline function deals
with 16-bit words.

>> In non-inline asm, I would write this as:
>> 
>>     movw    sum,%ax
>>     addw    b,%ax
>>     adcw    $0,%ax
>>     movzwl    %ax,%eax
>
> You do not want to use 16bit instructions on modern x86 processors. These 
> instructions are slow. Intel states that decoding a 16bit operation takes 6 
> cycles instead of the usual 1. (Intel® 64 and IA-32 Architectures 
> Optimization Reference Manual, section 2.1.2.2 Instruction PreDecode)

Yes I do, especially on modern x86's.  Back in 1988, prefix bytes always
took longer to fetch and decode.  Back in 1997, partial register stalls
made code optimized for space to use subregister operations run slow, but
I think that was more for code completely unaware of the problem (operating
in bytes and then using the whole word would give the P.R. stall, and I
think operating on the 2 low bytes gives a similar stall on some CPUs).
The above is not so bad.  I would only expect it to run slowly on very
old machines where the prefixes cost and the movzwl is slow.  But on the
old machines, to avoid the prefixes you would have to use something like
movzwl to expand everything to 32 bits before operating, and the code would
be slow for other reasons.  It takes API/ABI changes (use 32 bits for all
interfaces) to avoid the prefixes.

On newer x86's with better pipelines, prefix bytes are almost free.  E.g.,
the Athlon64 optimization manual says to use operand size prefixes for
fill bytes in many cases, because this takes less resources than nop's.
gcc and/or gas implements this.

I don't believe 6 cycles extra just for decoding on any CPU.

>> Pipelining can make bloated code run better than it looks, but probably
>> not for the generated code above, since shifts are slow on some i386's
>> and there is an extra dependency for the extra shift operation.
>
> Shifts were slow on early generations of the Pentium 4. Intel corrected this 
> "glitch" in later generations.

Also in original i386 and some Athlons (speed of shifts is bad and/or
went backwards relative to add).  Fixed in i486 and Phenom(?).  Athlons
made sh[lr]d especially slow, and IIRC there is a case where add is
better than shl $1.

Bruce