Comparison of quality of generated code by the compilers

Wed Mar 16 06:00:47 UTC 2011

One important attribute of compilers is the quality of the generated code .

To assess the difference between the quality of the generated codes of the
compilers
an experimental design may be used .

Assume the following design is used .

Select n distinct ( large as much as possible ) programs in such a way that
any source file in a program does not appear in another program
( except compiler libraries ) to prevent correlation between
programs where programs should be independent from each other .

If sample size is not computed from power of the tests formulas ,
select a sample size at least greater than 15 .

A sample size greater than 60 is extremely valuable .

Only two compilers are compared .

All of the programs are compilable by the compilers .
Execute programs and record their success or failure in the following
structure :

Program    CLang         GCC
------------    ----------    ---------

1              0 or 1        0 or 1

2              0 or 1        0 or 1
.
.
.
n              0 or 1        0 or 1

where
      0 is success ( only correct results without a crash )
      1 is failure ( crash or incorrect results ) .

When there are failures ,
generate a cross tabulation of the above table :

                           GCC                            GCC
                           --------------------------------------------
                            Success  ( 0 )               Failure ( 1 )
                          | ----------------------------|-------------------
CLang   Success |  count of ( 0 , 0 )    | count of ( 0 , 1 )
                          |  pairs                    | pairs
                          | ----------------------------|-------------------

CLang   Failure   |  count of ( 1 , 0 )     | count of ( 1 , 1 )
                         |  pairs                     | pairs
                         |
-----------------------------|--------------------

One of the following tests with respect to table structure ( especially
number of programs )
may be applied .

http://en.wikipedia.org/wiki/Barnard%27s_exact_test
( Barnard's test )

http://en.wikipedia.org/wiki/Fisher%27s_exact_test
( Fisher's exact test )

http://en.wikipedia.org/wiki/Chi-square_test
( Chi-square test )

http://en.wikipedia.org/wiki/Pearson%27s_chi-square_test
( Pearson's chi-square test )

If the difference ( the contingency coefficient ) is significant ,
   one compiler is best ( small number of failures ),
   the other is worst ( large number of failures ) .

----------------------------------------------------------

Assume there is no any failure , and execution times are available .

Program    CLang         GCC
------------    ----------    ---------

1               t              t

2               t              t
.
.
.
n               t              t

where t is the execution time of the program .

Apply paired t test .

If the paired differences are significant ,
   one compiler is best ( short execution time , small mean ) ,
   the other is worst ( long execution time , large mean )  .

---------------------------------------------------------

The above paired t test may be used for the generated program sizes .

If the paired differences are significant ,
   one compiler is best ( small program size , small mean ) ,
   the other is worst ( large program size , large mean )  .

Thank you very much .

Mehmet Erol Sanliturk