long(!) Re: need help on CFLAGS in /etc/make.conf please

Thu Feb 19 17:40:45 PST 2004

Hi Chuck, me again

I'll study the pointers you mentioned, and merely reply what I can for now.
Thank you very much for spending time on this.

>> The delay in my response here was due to pest control in our building
>> and the three-day weekend (I have no li'l-endians at home ;) .
> 
> No problem...and a good job of solving the endian-debate.  :-)

Heh... if you only knew...
I am actually a system programmer on IBM mainframes <- big-endians.
I better stop right there before I get really cranked-up.  :)

>>[...]
>> I want a default setting -O "iff"=="if and only if" the original does not
>> provide it.  That's what "default setting" means.  ;)
>
> If the port uses "CFLAGS ?=" or uses that value via implicit rules, you will 
> get the behavior you've asked for by not setting CFLAGS at all: ie, the port 
> will use whatever CFLAGS setting it has as the default unless another value is 
> specified elsewhere.
> 
> If the port uses "CFLAGS =", the port Makefile or possibly a patch in the 
> files subdirectory ought to override this to pay heed to the system-wide 
> settings.  In this case, you will have to modify that mechanism for each 
> relevant port yourself.

Yeah I found out the hard way.  ;)  I have CPUTYPE=p2 in my
/etc/make.conf because its example, e.g. CPUTYPE?=p2, didn't 'take' anywhere,
not even building kernel & world.
BTW this is one parm I must override, as this is an early p2 chip and it does
not have much of what is assumed in later i686 chips.  I had been seeing some
unexplainable glitches & traps etc. until I forced things to recompile with
this 'dumber' setting (told the higher-ups that, too).  Now at least I'm
seeing the same glitches that others are seeing (usually ;) and
repeatable/recreatable (usually ;) .  Having to explain to the higher-ups
why I can't use prebuilt binaries (packages or rpms) because 'they' provided
a box that has some really old chips (these were slated for state auction, but
we kept 'em because of the shortfall budget fiasco y'know). <sigh>

[...I'll snip some here and study it...]

>> And as far as i386 is concerned, it is looking like -O2 is the "maximum"
>> that should be attempted.  Other platforms, e.g. -march=7450 I use at home,
>> can go -O5 without problems.  At any rate, I certainly want to cut-down
>> anything like what Kris mentions e.g. -O999!
> 
> It's entirely likely that -O3, -O5, and -O999 will all behave exactly the 
> same.  Have you benchmarked any differences in performance?

It's going to be tough to do any accurate benchmarking: every time this p2
boots up is showing slightly different timecounter calibrations in dmesg. 
(Yeah that's all I need, a PC that may be on the skids. ;)  It isn't supported
anymore because IBM won't even consider maintenance contracts for this model.)

I forgot to mention another reason for wanting to set -O levels.  Now,
I'm not overly sure about GCC's criteria; it keeps changing as it is
developed, and is different for other platforms & chips.  But AFAIK for
-march=7450 (G4) and the way Apple has it working under XCode:  GCC will
not honor some of the compiler's tweaking flags if -O is not high enough. 
For example, MPlayer sets this high on purpose, so GCC will actually
'turn on' what is specified in MPlayer's Makefile for loop-unrolling and
other such tweaks.  IIRC GCC certainly needs the -faltivec parm before
it'll even consider compiling any Motorola vector code in the src -- if
you don't have a 'non-vector replacement' function there, the vector
code will simply be ignored and nothing inserted in its place.  "Oops"

OTOH a higher -O will automatically turn on other tweaking flags that you
might not want or need to steer in your code to function correctly.  Each
next level -O turns on the previous level's tweaks plus some more.

Benchmarks aren't the main reason to give it a high -O.  I'm finding this
difficult to explain what I mean here exactly, because of course faster
benchmarks are the end result.  Let me try:  GCC won't use the full power
of the chips (cache, coprocessors, etc.) and/or won't consider other
flags & options if told not to optimize (in effect).  GENERALLY.
I hope I said that correctly.  ;)

BTW I've been testing -O2 for the custom kernel during the past few days. 
FWIW the 'feel' is _quite_ different.  ;)  I _think_ we shaved ~5 minutes
off for a buildkernel (usually around 30 minutes with plain -O). 
If I can find a way to make decent standalone backups, I'd love making
world at -O2, too.
And as of lunchtime I have hand-patched and compiled the libthr "SIG-less"
changes that can be seen on the -threads list (posted earlier today).

> You might find it interesting to review a thread from July of last year titled 
> "buggy optimization levels", in which I wrote:  /usr/src/contrib/gcc/toplev.c 
> is clear enough which specific optimizations are involved at the different 
> number levels:
> 
>    if (optimize >= 1)
>      {
>        flag_defer_pop = 1;
>        flag_thread_jumps = 1;
> #ifdef DELAY_SLOTS
>        flag_delayed_branch = 1;
> #endif
> #ifdef CAN_DEBUG_WITHOUT_FP
>        flag_omit_frame_pointer = 1;
> #endif
>      }
> 
>    if (optimize >= 2)
>      {
>        flag_cse_follow_jumps = 1;
>        flag_cse_skip_blocks = 1;
>        flag_gcse = 1;
>        flag_expensive_optimizations = 1;
>        flag_strength_reduce = 1;
>        flag_rerun_cse_after_loop = 1;
>        flag_rerun_loop_opt = 1;
>        flag_caller_saves = 1;
>        flag_force_mem = 1;
> #ifdef INSN_SCHEDULING
>        flag_schedule_insns = 1;
>        flag_schedule_insns_after_reload = 1;
> #endif
>        flag_regmove = 1;
>      }
> 
>    if (optimize >= 3)
>      {
>        flag_inline_functions = 1;
>      }
> 
> This was for gcc-2.95; in gcc-3.4 this code was moved to a file called opts.c, 
> but -O4, -O5, and -On for any n >=3, all do the same thing.  Really!

I can kinda see why. ;)  I mentioned above how each next level of -O will 
include what the previous level turned on, and then add more optimizing. 
I think GCC-for-i386 stops at -O3, meaning there ain't no mo' tweaking it can
do.  GNU's web site documents it.  And remember I said I wouldn't trust it
past -O2 for the time being (on i386).  We'll see how MPlayer's -O3
works tomorrow.  ;)

The AIM alliance documents GCC going up to -O5 for the PPC chips, esp. for the
models that have Altivec.  Each level for PPC may or may not correspond to
the same level on i386.  That's why, for example, MPlayer's own Makefile will
do -O5 on my G4 at home and only do -O3 for this i386 box here.  It needs
something at level -O5 on PPC -- and it does not necessarily correspond to
any -O level on i386, see. 
We can't "compare" -O levels across vastly different chips like this.  This is
a bit like Apple's "MegaHertz Myth" -- we just can't compare vastly different
chips like this.  AMD was fighting the same kind of thing in their advertising. ;)

As an aside --
Since the days of the K&R C compiler (that came with Microware OS-9 for the
Tandy / Radio Shack Color Computer 3 and Cumana's version for Atari-ST),
we would've coded a fairly long single-line statement such as

      this, that, other, stuff = 1;

and the K&R compiler would've known exactly the kind of optimization we wanted. 
It would even use a register to contain the '1' constant, as memory-moves were
(still are) expensive -- moved, say, from where C keeps the 'constants'. 

Today's compilers ought to be able to figure out what is common with separate
statements like

      this = 1;
      that = 1;
      other = 1;
      stuff = 1;

If today's much-smarter compilers couldn't figure out the common-ness of your
code like this, I would find another compiler!  ;)

BTW the G4's Altivec chip can set all those fields in one fell swoop! ;)  The
trick is designing a compiler that can 'realize' such actions *automatically*,
because right now we must use the agreed-upon mnemonics that GCC knows is
part of the 745x CPU, which is turned on by both the -faltivec flag and an
appropriate -march= CPU model together.  The mnemonics were accepted by the
Apple-IBM-Motorola [AIM] alliance, and (finally) the GNU folks got their stuff
rolled into GCC so it can be an official cross-compiler.  But as I just said
we must use the vernacular for Altivec to use these optimizations -- the trick
is to train GCC to recognize how to do this automatically!  I don't see that
happening at all, btw.  Talk about A.I. ;)

>[ ... ]
>> A msg from Richard Coleman, taken together with the GCC 3.x Known Bugs
>> site, is leading me to believe any bugs solely due to higher -O levels need
>> to be fixed by the author(s) of the software.
> 
> Heh.  With regard to optimization, page 586 of _Compilers: Principles, 
> Techniques, and Tools_ states:
> 
> ]First, a transformation must preserve the meaning of programs.  That is, an 
> ]"optimization" must not change the output produced by a program for a given 
> ]input, or cause an error, such as a division by zero, that was not present in 
> ]the original program.  The influence of this criterion prevades this chapter; 
> ]at all times we take the "safe" approach of missing an opportunity to apply a 
> ]transformation rather than risk changing what the program does.

I could've sworn I read that back in the 1970s & '80s with K&R!  ;)

But I was trying to knock down some age-old notions that GCC had optimization
bugs in & of itself.  I *really* believe 3.x can be trusted a lot more than
what some people seem to want to do -- as long as you know what its Known Bugs
cases are and how to deal with 'em.  ;)

> [ ... ]
>> You're changing what the author sets-up before any hack-job I invent will
>> even see it.  Why?  If I interpret what Kris said correctly, he wants you
>> to think GCC 3.x is the source of the bugs at -O2+.
> 
> You're not interpreting Kris' position correctly.
> 
> I believe that Kris disavows setting higher optimization levels because it is 
> extremely difficult to track down the bugs which result (most particularly in 
> the kernel, which must do all sorts of pointer-aliasing games) and thus the 
> cost/benefit ratio of higher optimizations isn't worth his time.

I can see it your way, too, but there was something somewhere that caused me to
scrub my chin and rethink what Kris meant, but now it's lost...

An idea did pop into my head just now, tho, actually it's a 'remembrance' of
what we did in the old days (and still do on mainframes).  We can always study
the assembler src output from C before it gets processed further to see where
it had gone wrong during optimization.  Oh you bet we'll file a report with
IBM/whoever when we can prove it.

BTW for maximum tuning on _final_ code, we would edit the C asm src output
manually -- you'd be amazed how much can be cut out and redesigned, still,
and that's how a 1.7MHz 6809 (CoCo3) could beat a 4MHz 80286 (IBM's AT).  ;)

>[ ... ]
>> I reiterate the notion of other platforms working fine with optimizations
>> and FBSD is slowing down because IMHO of some age-old assumptions about GCC
>> itself.  As a specific example:  If GCC 3.3.3 generates really fast code
>> on a Linux/i386 app *and* it's proven to work well, then FBSD/i386's code
>> should fly just as fast at the same level with no problem.  Oh but y'all
>> are hacking the guts out of the optimization settings coming from the
>> author, so FBSD/i386 will never see the same end-results here.
> 
> Paul, you really ought to benchmark what the compiler actually does between 
> -O2 and -Onnn: often, there is zero difference in performance.
> 
> It would be unusual for there to be more than a factor-of-two difference in 
> performance between unoptimized code and -O (aka -O1); -O2 might buy you 
> another 10-20%, and -O3, -O4, or higher 5% or less.  YMMV.

I was eluding to optimization bugs in GCC itself.  If Linux/i386's GCC can
generate good functional code and be fast, GCC on FBSD should be able to
do the same -- both should emit proper optimized i386 instructions, etc.
The point going back to my earlier wish to trust the author's original
Makefile and not hack it further with FBSD's Makefile -- if the author's
platform was Linux, it's highly likely it is i386 also, and we on FBSD/i386
should be able to trust his settings there and reap the same rewards. ;)

Now ya got me contemplatin' on my hack --

For apps that don't have any tweaks, I think a successful boost would need
a combination of -O and other parms.  I'd still leave it up to the author,
unless it is something used in many places so much (libs).

I removed my -O in /etc/make.conf to rebuild MPlayer with today's CTM deltas
(it finally got un-broke).  MPlayer uses -O3 if I set another of its
documented knobs, which I did also.  We'll see how that behaves tomorrow.

At the same time, a revision-bump for Epiphany came thru today, too.  It
did not use any -O at all (my -O was still removed).  Here's a good example
of an often-used app that could use some of that kind of tweaking.  Okay,
it sets ${CFLAGS}, so my CFLAGS?= setting would not be noticed, so ?= is
not the way to do this.  Doing CFLAGS+= _might_ be noticed, but there are
other ports that would not pick it up as I meant (depending on exactly
where those ports include ${CFLAGS}, if at all, and some will override it
completely with a plain single '=').

The problem with my hack is that we don't have a separate setting/knob for
-O by itself ... it will be found 'somewhere' in ${CFLAGS}. 
I'm thinking my hack would entail scanning the resulting ${CFLAGS} after
gmake has finalized it but before invoking GCC/whatever... somehow. 
On top of that, your sed example (previous msg) might accidentally change
a generic string i.e. a '-O' somewhere else in the line that is passed to
GCC/whatever.  I'll be pondering that, too.

Taking Epiphany as a further example here.  If users are complaining about
its slowness, I would think its authors should be responsible for adjusting
its compiler settings and issue a 'beta' version of the app for testing in
that mode.   That team would be more apt to spot optimization glitches better. 
OTOH 'we' could tweak Epiphany's Makefile on our own and provide feeback
to the team.
(For something as big as KDE and the number of users complaining already,
I'm hoping the KDE teams are listening. ;)

It's just that right now to provide a default -O hack and make it work
as a true default is looking ugly but I'd like to try anyway. ;)

As an aside --
I'm seeking to figure out how Apple got Panther _noticably_ faster than
Jaguar.  That's the kind of 'oomph' we need to do on FBSD/i386.  I mean, my
old upgraded PowerMac 7600 350MHz G3 + 66MHz bus/RAM + ATI Radeon PCI is
faster than this Puny Pentium2 450MHz + PC100 RAM + ATI Radeon AGP.  (My G3
box would be a fairer comparison than my G4 Sawtooth box. ;)  At any rate,
there's sumthin wrong here!  ;)  Most if not all of Panther's speed has got
to be because it switched us to GCC-3.3, but I'm not seeing such a difference
between GCC 2.95 (Jaguar / FBSD 4.x) and GCC 3.3 (Panther / FBSD 5.x) on i386
here.  We're missing something on i386 here, and I'm trying to find 'it'. 
I wouldn't think Apple is spending time editting the C asm src output as we
did back in the CoCo3 days I mentioned above (and still do on mainframes). ;)
If GCC-for-i386 isn't optimizing that much with higher -O, then we need to
figure out why and make it better ('we' as a GNU project).  I'm hoping the
feedback from Apple is helping _all_ FBSD folk here, not just the PPC folk. ;) 
But for the time being, the hacking that removes or changes the author's -O
and other flags is the main thing constraining us.

The 'fink' project for MacOSX is completely ignoring my environment settings,
too, I mean completely ignoring anything I set.  So I don't/can't use their
nifty automatic tools & methods to install open-source apps at home.  That's
a whole 'nuther story, but just to show you the problems on other projects.

P.S. IBM's mainframe (OS/390, z/OS) does not provide any sort of C-language
macros or defs or anything at all for system-level code.  Either we write it
in 370 Assembler or we use IBM's proprietary language PL/S (based on PL/1). 
Now, Applications _can_ be written in C, tho (linklibs will turn e.g. 'get()'
into the appropriate system call), but system-level code & libs etc. must be
Asm or PL/S.  I.e. if we have a bug in the deep-down call for 'get' itself,
ya gotta know 370 Assembler.  If you write system stuff in C, you're
completely on your own, IBM ain't gonna help ya there. 
I know we can't do BSD stuff in 100% Asm, it wouldn't be portable, but I'm
trying to relate how much mainframe stuff is actually Asm.  IBM is trying to
change how much Asm can be re-written in C, because they *know* there won't
be people they can hire that know 370 Asm.  But now you'll see bugs in what
was once a trusted set of system utilities -- because a bug that originated
in the C libs has propagated out to system-level code, or a bug that was fixed
in Asm doesn't have a C equivalent.  Nasty stuff I tell you, because it's
already happened, yes really. ;)
And don't get me started about endian-ness. ;)

> -- 
> -Chuck

I'll study the links you provided...

Thank you again,

  --  Paul Seniura (in OkC)