cvs commit: src/sys/sparc64/include in_cksum.h

Bruce Evans brde at
Sun Jun 29 11:28:45 UTC 2008

On Sun, 29 Jun 2008, Christoph Mallon wrote:

> Bruce Evans wrote:
>>>> Right.  Though I've never seen unnecessary's __volatiles significantly
>>>> affecting i386 code.  This is because the code in the asms can't be
>>>> removed completely, and can't be moved much either.  With out of order
>>>> execution, the type of moves that are permitted (not across dependencies)
>>               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>>> are precisely the type of moves that the CPU's scheduler can do or undo
>>>> no matter how the compiler orders the code.
>>> I disagree. For example look at the use of in_addword() in dev/sk/ 
>>> in line 2819:
>>>  csum1 = htons(csum & 0xffff);
>>>  csum2 = htons((csum >> 16) & 0xffff);
>>>  ipcsum = in_addword(csum1, ~csum2 & 0xffff);
>>>  /* checksum fixup for IP options */
>>>  len = hlen - sizeof(struct ip);
>>>  if (len > 0) {
>>>    return;
>>>  }
>>> The calculation will be executed even if the following if (len > 0) leaves 
>>> the function and the value of ipcsum is unused.
>>> If in_addword() is not marked volatile it can be moved after the if and 
>>> not be executed in all cases. csum1 and csum2 can be moved after the if, 
>>> too.
>> No, volatile has no effect on whether the above calculation will be
>> executed, since the early return has no dependencies on the caclulation.
> The volatile induces a dependency.

Dependency on what?  Perhaps the only guarantee is that a volatile asm
with no other dependencies is executed at some point before the next
function call or return. doesn't guarantee anything except not
removing a volatile asm, but moving it far enough away to not execute it
before return would be equivalent to removing it.

>> Old versions of gcc used to handle volatile like that, but this changed
>> in gcc-3 or earlier. now says:
>> % The `volatile' keyword indicates that the instruction has important
>> % side-effects.  GCC will not delete a volatile `asm' if it is reachable.
>>                                                       ^^^^^^^^^^^^^^^^^^^
> This is not about whether the code is reachable or not (it is reachable), it 
> is about whether the result is used (i.e. whether the code is dead).
>> % (The instruction can still be deleted if GCC can prove that
>> % control-flow will never reach the location of the instruction.)  Note
>> % that even a volatile `asm' instruction can be moved relative to other
>>        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>> % code, including across jump instructions.  For example, on many targets
>>   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> jump != conditional jump.
> If it moves the *volatile* asm statement across the if, it would only appear 
> on *some* execution paths, which is wrong. It is perfectly fine to move it, 
> when the statement is not volatile, though.

I don't see why a plain conditional jump would be restricted.  Only returns
and possibly function calls must be restricted, since the compiler can't
do any dependency analysis across returns so it must execute undead code
before return.

The above paragraph from continues as follows:

% there is a system register which can be set to control the rounding
% mode of floating point operations.  You might try setting it with a
% volatile `asm', like this PowerPC example:
%             asm volatile("mtfsf 255,%0" : : "f" (fpenv));
%             sum = x + y;
% This will not work reliably, as the compiler may move the addition back
% before the volatile `asm'.  To make it work you need to add an
% artificial dependency to the `asm' referencing a variable in the code
% you don't want moved, for example:

Suppose that this example has an if statement:

 	asm volatile(...);
 	if (independent_var)
 		sum = x + y;
 		sum = y + z;

Here there is no reason for this simple if statement to prevent moving
the asm any more than the simple assignment, since nothing in any
clause of the if statement has any explicit dependencies on the asm,
just like the simple assignment has no explicit dependencies.

>> Even if gcc didn't move the caclulation, then CPUs with out of order
>> execution might schedule it so that it is effectively never executed
>> (most likely by executing it in otherwise-unused pipelines while the
>> main pipeline returns).  This is valid for the same reasons that gcc
>> can move the volatile asms -- the return doesn't depend on the result
>> of the caclulation.
> This is for the CPU to decide. If the assembler block really contains 
> "important" stuff like memory barriers, writes to machine control registers 
> etc., the CPU will not "effectively never execute" the code. The compiler 
> does not know this, all it sees is the word "volatile".

The above example shows that volatile is not enough for a memory
barrier.  A memory clobber is probably needed for barriers, but
interferes with optimization.  On ia64, barrier instructions seem to
just use volatile, while on i386 the only fairly explicit barrier
function is bus_space_barrier(), which uses a dummy locked instruction
plus a memory clobber for read and a null instruction plus a mmory
clobber for write.  The hardware is barely involved, but the memory
clobbers are needed to tell the compiler not to move loads and stores
across the asm, or at least to force them to memory.

>> The above C code is fairly bad, but generates not so bad code on i386:
>> % %     movl    %esi, %eax
>> % #APP
>> %     xchgb %ah, %al        # byte operations can be slow; this one not
>>                 # too bad, but I wonder if rorw $8 is better
>>                 # (rorl $16 is already used for corresponding
>>                 # 32-bit operations) where there is no xchg
>>                 # alternative
>> % #NO_APP
> And this again is an example why not to use inline assembler, but let the 
> compiler decide this:
> unsigned short swap16(unsigned short x)
> {
>  return x >> 8 | x << 8;
> }
> is compiled to
> swap16:
>        movl    4(%esp), %eax
>        rolw    $8, %ax
>        ret

Ah, that's an example of the compiler turning naive C code into shifts
that I couldn't find last time I looked.

gcc-4.2.1 doesn't generate quite such good code here:

% swap16:
% 	movzwl	4(%esp), %eax
% 	rolw	$8, %ax
% 	movzwl	%ax, %eax
% 	ret

This takes -fomit-frame-pointer and -O (arch=native).  For some arches
it generates movl instead of the first movzwl.  For plain i386 it
generates movl but should generate movw.  It seems to generate the
final movzwl.  This is bogus, but is probably required by the i386 ABI
(to be backwards compatible with unprototyped functions).  On amd64,
it does a direct rolw on a register arg (I think the movzwl still gets
done, but in the caller), and it still does the movzwl on the return

gcc-3.3.3 generates horrible code (movl; movl; shrw; movzwl; sall; orl;
movzwl with -O and movzwl; movl; shrw; sall; orl; movzwl with -O2).

gcc-4.2.1 produces worse code than the above for the __byte_swap_word_const()
macro in i386/include/endian.h, and good code like yours for
__word_swap_int_const() there, and disgustingly bad code for __bswap64()
there. doesn't mention bswap, so the asm version is out of reach,
but surely the 64-bit ints can be handled better.

>> %     shrl    $16, %esi
>> %     movl    %esi, %edx
>> % #APP
>> %     xchgb %dh, %dl        # as above
>> % #NO_APP
>> %     notl    %edx        # poor asm code -- the top 16 bits are unused
>>                 # except here to stall for merging them with
>>                 # the previous byte operation
> The compiler simply does not know, that the inline assembler only operates on 
> parts of the register. Another reason not to use inline assembler:

Actually, this is poor compiler code and the compiler does know here -- the
asm returned a u_short, and the compiler doesn't use the top 16 bits except
to let garbage accumulate in them.  This might not actually stall, so doing
this would be good, for the target CPU.  I forget what the target CPU was.

> u_int g(u_short x)
> {
>  return ~swap16(x);
> }
> g:
>        movl    4(%esp), %eax
>        rolw    $8, %ax
>        movzwl  %ax, %eax         # avoid stall
>        notl    %eax
>        ret

Inlining is good for avoiding a movzwl's in both the callee and the caller.

Here you have changed the interface a little -- g() returns 32 bits
-- so the movzwl and the notl are good for implementing C's promotion
rules and it isn't clear that they are to avoid the stall.  However,
when I changed g() to return u_short, they are still generated, even
on plain i386 where there is a slow movzwl and no stalls.


More information about the cvs-src mailing list