buggy optimization levels...

Sun Aug 3 12:34:31 PDT 2003

Erik Trulsson wrote:
> On Sat, Aug 02, 2003 at 03:52:25PM -0400, Chuck Swiger wrote:
[ ... ]
> That wasn't my real point anyway. I was trying to refute your statement
> that "Even if the code contains a bug, "cc -O" and "cc -O -fgcse"
> should produce the same results."
> 
> I claim that if the code has a bug that results in undefined behaviour
> then the compiler is allowed to produce different results when invoked
> with different optimization flags.

If the code being compiled has a bug that results in undefined behavior, the 
compiler is allowed to produce different results when invoked with different 
optimization flags.

While true, that doesn't refute my statement: what the compiler is allowed to 
do, and what the compiler should do, is required not to diverge for code which 
does not involve undefined behavior.

[ ... ]
>>Page 586 of _Compilers: Principles, Techniques, and Tools_ states:
>>
>>First, a transformation must preserve the meaning of programs.  That is, an 
>>"optimization" must not change the output produced by a program for a given 
>>input, or cause an error, such as a division by zero, that was not present 
>>in the original program.  The influence of this criterion prevades this 
>>chapter; at all times we take the "safe" approach of missing an opportunity 
>>to apply a transformation rather than risk changing what the program does.
>>
>>	--
>>
>> Like your divide-by-zero example above, or this paragraph about "semantics" 
>> vs "meaning", I'm not going to disagree if you want to state that the 
>> running time of a program is part of the "behavior".  However, using the 
>> terms in such a fashion precludes you from understanding the intended 
>> meaning in this particular context.
> 
> I understand the intended meaning. I just don't agree completely with
> your reasoning.   I would say that the compiler is not allowed to
> *introduce* errors, or to change the meaning of *correct* programs.

So far, so good.

> (Correct here essentially meaning programs that do not invoke undefined
> behaviour.)  For programs that do not have a defined 'meaning' the
> compiler is free to do anyting.

This is wrong.  The fact that integer divide-by-zero by not defined by ANSI C 
(C89, et al), means the following program does not have well-defined behavior:

int main() {
   return 120 / 0;
}

...however, what happens if you remove the semicolon after the zero?

That results in a syntax error, and the compiler is _not_ "free to do anything" 
it likes in such a case: it is required to identify the syntax error in code 
which fails to parse by issuing an error message:

d.c: In function `main':
d.c:3: syntax error before `}'

There is more to the point just above than just identifying syntax errors.  If 
you have 5 input source files (call them "translation units"), of which four are 
valid code, and the fifth contains an error resulting in undefined behavior, the 
compiler is required to compile four of the five source files in a well-defined 
fashion.

One can repeat that analysis for code within the fifth input source file: if you 
moved only the function containing the bug/undefined code to a sixth file, one 
would discover that the compiler will handle the rest of the code in that file 
in a well-defined fashion.

One could repeat the analysis yet again, using units known as "basic blocks", 
which may either be a single intermediate code instruction (that's intermediate 
code within the AST, not the input source code), or a sequence of instructions 
terminated by a flow-of-control instruction like a branch, return from 
subroutine, memory protection/barrier commands, etc.

> If a compiler was not allowed to change the result of running a program
> having undefined behaviour when compiled with different optimization
> flags, then this would preclude doing just about any optimization.

No, it would not.  Code optimization consists of transformations to the AST 
based on algebraic invariants, liveness analysis used by CSE and dead-code 
removal, loop invariants, and other techniques which are universal 
(platform-independant), as well as register allocation, peephole analysis, and 
other transformations to the target code which are platform-specific.

What happens to source code with undefined semantics isn't particularly useful 
to the person writing compiler: valid optimization techniques are required to 
not change the meaning of any possible well-defined input source code, so being 
able to behave differently in the case of undefined semantics doesn't help.

> I am fairly certain that for *all* the specific optimizations available
> in gcc it is possible to find *some* program that will give different
> results depending on if that optimization ws used or not.

Are you claiming that every specific optimization available in gcc contains 
bugs?  :-)

> If one were to accept your claims that the compiler should not perform
> any optimizations that could change the behaviour any program, then it
> would not be able to do any optimization at all, which is clearly not a
> desirable situation.

It is entirely possible to write an optimizing compiler which does not make any 
changes to the behavior of any valid input source program.  Aho, Sethi, and 
Ullman spend about 150 pages discussing how to do so in the reference I 
provided.  A simple example would be:

	if (0 && (expression)) statement;

DeMorgan's laws and short-circuit evaluation in C let you simply that to:

	if (0) statement;

...which will then be removed entirely by dead-code analysis.

> The reason why the C standard does not define the behaviour of certain
> constructions or code sequences, and why compilers give different
> results when compiling such code, is that doing so allows compilers to
> perform more aggressive optimizations on correct code.

People sometimes compile without using the optimizer at all, yes?

The issues of why "the C standard does not define the behavior..." and "code 
optimization" are orthogonal.  For the most part, the reason why C standard 
leaves things undefined is because of platform-specific differences in how the 
underlying hardware behaves-- things like register, long, and pointer size, byte 
order.

It's generally easier to write optimizations where things are well-defined than 
where they are not: in particular, the pointer aliasing issues in C make 
liveness analysis a nightmare compared to Java or Python.

-- 
-Chuck