Comparison of quality of generated code by the compilers
Mehmet Erol Sanliturk
m.e.sanliturk at gmail.com
Wed Mar 16 06:00:47 UTC 2011
One important attribute of compilers is the quality of the generated code .
To assess the difference between the quality of the generated codes of the
compilers
an experimental design may be used .
Assume the following design is used .
Select n distinct ( large as much as possible ) programs in such a way that
any source file in a program does not appear in another program
( except compiler libraries ) to prevent correlation between
programs where programs should be independent from each other .
If sample size is not computed from power of the tests formulas ,
select a sample size at least greater than 15 .
A sample size greater than 60 is extremely valuable .
Only two compilers are compared .
All of the programs are compilable by the compilers .
Execute programs and record their success or failure in the following
structure :
Program CLang GCC
------------ ---------- ---------
1 0 or 1 0 or 1
2 0 or 1 0 or 1
.
.
.
n 0 or 1 0 or 1
where
0 is success ( only correct results without a crash )
1 is failure ( crash or incorrect results ) .
When there are failures ,
generate a cross tabulation of the above table :
GCC GCC
--------------------------------------------
Success ( 0 ) Failure ( 1 )
| ----------------------------|-------------------
CLang Success | count of ( 0 , 0 ) | count of ( 0 , 1 )
| pairs | pairs
| ----------------------------|-------------------
CLang Failure | count of ( 1 , 0 ) | count of ( 1 , 1 )
| pairs | pairs
|
-----------------------------|--------------------
One of the following tests with respect to table structure ( especially
number of programs )
may be applied .
http://en.wikipedia.org/wiki/Barnard%27s_exact_test
( Barnard's test )
http://en.wikipedia.org/wiki/Fisher%27s_exact_test
( Fisher's exact test )
http://en.wikipedia.org/wiki/Chi-square_test
( Chi-square test )
http://en.wikipedia.org/wiki/Pearson%27s_chi-square_test
( Pearson's chi-square test )
If the difference ( the contingency coefficient ) is significant ,
one compiler is best ( small number of failures ),
the other is worst ( large number of failures ) .
----------------------------------------------------------
Assume there is no any failure , and execution times are available .
Program CLang GCC
------------ ---------- ---------
1 t t
2 t t
.
.
.
n t t
where t is the execution time of the program .
Apply paired t test .
If the paired differences are significant ,
one compiler is best ( short execution time , small mean ) ,
the other is worst ( long execution time , large mean ) .
---------------------------------------------------------
The above paired t test may be used for the generated program sizes .
If the paired differences are significant ,
one compiler is best ( small program size , small mean ) ,
the other is worst ( large program size , large mean ) .
Thank you very much .
Mehmet Erol Sanliturk
More information about the freebsd-current
mailing list