Branch prediction

Tue Feb 17 02:06:13 PST 2004

On Sun, Feb 15, 2004, Trent Nelson wrote:
>     For as long as I've been programming, I've always been under the
>     impression that CPUs will always predict a branch as being false 
>     the first time they come across it.
> 
>     Many, many years ago, I came across a DEC programming guide that
>     said the same thing.  It suggested using 'do { } while()' over
>     'while() {}' for loops as this ensured that the loop block was 
>     correctly pre-fetched (rather than whatever followed the loop) 
>     the first time the branch was encountered.
> 
>     To this day, I still write all my code in this fashion.  I was
>     wondering, have either CPU architectures or compilers improved 
>     over time such that this isn't an issue anymore?  Are CPUs able
>     to intelligently predict a branch the first time they encounter
>     it?  Do compilers re-order blocks to optimise pre-fetching and
>     minimise branch mis-prediction?
> 
>     With CPUs like the P4 -- with it's 20-something stage pipeline
>     -- branch mis-prediction is expensive.  I realise certain CPUs
>     optimise their branch prediction units by maintaining branch
>     prediction histories, which would help when a branch is encoun-
>     tered repeatedly, but does the old adage of "always predict 
>     false" still hold true the first time a branch is encountered?
> 
>     So, writing code such that the "true" branch is intended to be
>     executed far less than what comes after it; good practice or 
>     pointless habit?

Both CPUs and compilers have evolved considerably since then,
largely because modern processors such as the Pentium 4 have such
long pipelines.  Typically processors have branch history tables
with accuracy in the 95% to 99% range for many applications,
although there's a wide range of ways to implement these.  Some
processors rely on per-branch patterns, some use global patterns,
and some use combinations of the two.  Here are two good papers
on the topic:

	Marius Evers, Sanjay J. Patel, Robert S. Chappell, and Yale N. Patt,
	"Analysis of Correlation and Predictability:
	What Makes Two-Level Branch Predictors Work,"
	Proceedings of the International Symposium on Computer Architecture
	(ISCA-25), June 1998

	Tse-Yu Yeh and Yale Patt,
	"Alternative Implementations of Two-level Adaptive Branch Prediction",
	Proceedings of the International Symposium on Computer Architecture
	(ISCA), 1992.

Heuristics such as ``always predict false'' are uncommon.  The
first time a branch is encountered, many processors choose to
predict taken for backwards branches (because of loops) and
predict not taken for forwards branches.  However, branches that
are not in the history cache make up a very small proportion of
all branches encountered, and thus have only a modest impact on
performance.

Compilers are also pretty smart these days, and have been for
quite some time.  A good compiler will do the loop unrolling,
software pipelining, blocking, prefetching, and instruction
scheduling necessary to get good performance regardless of how you
write your loops.  Of course, in the specific case of 'while ()
{}' versus 'do { } while ()' the two loops do not have exactly the
same semantics; the former *requires* the compiler to perform the
test before entering the loop for the first time unless it can
prove the test unnecessary.  But chances are that this will be a
forward branch that is correctly predicted, so the cost will be
only a cycle or two.

Of course, the compiler doesn't always know what your code is
going to do at runtime, so it can't always make the best
optimizations.  Some commercial compilers can use profiling data
generated at runtime to perform better optimizations.  In gcc, you
can use the builtins __builtin_expect(test, 1) and
__builtin_expect(test, 0) for performance-critical code.