svn commit: r227812 - head/lib/libc/string

Tue Nov 22 11:21:14 UTC 2011

On Mon, 21 Nov 2011 mdf at FreeBSD.org wrote:

> On Mon, Nov 21, 2011 at 6:50 PM, Eitan Adler <eadler at freebsd.org> wrote:
>> Author: eadler (ports committer)
>> Date: Tue Nov 22 02:50:24 2011
>> New Revision: 227812
>> URL: http://svn.freebsd.org/changeset/base/227812
>>
>> Log:
>>  - fix some style(9) nits with my last commit
>>  - add a comment explaining why I used '|' instead of '||'
>>
>>  Submitted by: danfe@
>>  Approved by:  emaste@
>>
>> Modified:
>>  head/lib/libc/string/strcasecmp.c
>>  head/lib/libc/string/strncmp.c

Hard 0xa0's make the quotes almost unreadable :-(.

>> Modified: head/lib/libc/string/strcasecmp.c
>> ==============================================================================
>> --- head/lib/libc/string/strcasecmp.c   Tue Nov 22 02:27:59 2011        (r227811)
>> +++ head/lib/libc/string/strcasecmp.c   Tue Nov 22 02:50:24 2011        (r227812)
>> @@ -49,7 +49,7 @@ strcasecmp_l(const char *s1, const char
>>                        *us1 = (const u_char *)s1,
>>                        *us2 = (const u_char *)s2;
>>        if (s1 == s2)
>> -           return (0);
>> +               return (0);
>>
>>        FIX_LOCALE(locale);
>>
>> @@ -73,8 +73,9 @@ strncasecmp_l(const char *s1, const char
>>                        *us1 = (const u_char *)s1,
>>                        *us2 = (const u_char *)s2;
>>
>> -       if (( s1 == s2) | (n == 0))
>> -           return (0);
>> +       /* use a bitwise or to avoid an additional branch instruction */
>> +       if ((s1 == s2) | (n == 0))
>> +               return (0);
>
> I guess I'm a little confused.  Do we really have profiling
> information at this level that suggests the overhead of the branch is
> significant?  I thought most hardware had pretty good
> branch-prediction, particularly with speculative execution.

Every time I have profiled the string functions, changes like this always
have an insignificant effect, even in benchmarks.  Normal code spends
perhaps 0.01% of its time calling string functions, so the improvements
from optimizing string functions are 0.01% of an insignificant amount.

If this optimization were any good, then the compiler would already do
it.  In fact, gcc-4.2.1 already does it -- the reverse of it -- it rewrites:

 	"if ((i == 0) | (j == 0)) return; test();"

into:

 	"if (i == 0 || j == 0) return; test();"

with most optimization flags that I tried.  This happens even at -O
with -march=i386, although since a plain i386 doesn't have branch
target prediction and has slow branches, the first version is probably
actually faster for a plain i386.  But gcc-3.3.3 doesn't do this
rewrite, even with -O3 -march=athlon-xp, For i386, the generated code
is quite slow:

gcc-3.3.3 -O -march=i386:
% 	cmpl	$0, i
% 	sete	%dl
% 	cmpl	$0, j
% 	sete	%al
% 	orl	%edx, %eax
% 	testb	$1, %al
% 	jne	.L3
% 	...			# call to test() not shown here or below
% .L3:
% 	leave
% 	ret

`sete' is quite slow on plain i386 too.  For athlon-xp:

gcc-3.3.3 -O3 -march=athlon-xp:
% 	movl	j, %edx
% 	movl	i, %eax
% 	testl	%eax, %eax
% 	sete	%cl
% 	testl	%edx, %edx
% 	sete	%al
% 	orl	%ecx, %eax
% 	testb	$1, %al
% 	je	.L4
% .L3:
% 	leave
% 	ret

This is essentially the same, except it uses load-test instead of cmp-mem,
and the branch is taken in the opposite case.

Weirdly, gcc42 does almost exactly the opposite with i386 and athlon-xp:

gcc-4.2.1 -O -march=i386:
% 	cmpl	$0, i
% 	je	.L5
% 	cmpl	$0, j
% 	je	.L5
% 	...
% .L5:
% 	leave
% 	ret

gcc-4.2.1 -O3 -march=athlon-xp:
% 	movl	i, %edx
% 	movl	j, %eax
% 	movl	%esp, %ebp
% 	testl	%edx, %edx
% 	sete	%dl
% 	testl	%eax, %eax
% 	sete	%al
% 	orb	%al, %dl
% 	je	.L7
% 	leave
% 	ret

I guess this is because it knows that 'sete' is slow on i386.  But for
nocona, it goes back to rewriting the expression to use || so that there
are 2 branches:

gcc-4.2.1 -O3 -march=nocona:
% 	movl	i, %edx
% 	testl	%edx, %edx
% 	je	.L5
% 	movl	j, %eax
% 	testl	%eax, %eax
% 	jne	.L7
% .L5:
% 	popl	%ebp
% 	ret

This is the same as for plain i386 and plain -O, except it use load-test
instead of cmp-mem the same as gcc-3.3.3 does for athlon-xp.  It doesn't
take the branch in the opposite case like gcc-3.3.3 does for athlon-xp.

> Wouldn't something like __predict_false() have more value for
> performance,

Unlikely.  Both writing | vs ||, and writing __predict_foo(), are only
hints to the compiler, since the compiler knows how to rewrite the
former but probably shouldn't unless it is sure, and it can ignore
__predict_foo() but probably shouldn't unless it is sure.  Or perhaps
it trusts the programmer's directives for branches as much as ones for
`register'.  Similarly for `if (foo) ...; else ...;' vs `if (!foo)
etc' - the programmer may have chosen the order carefully, but who
knows?  gcc-3-3.3 and gcc-4.2.1 -O3 should know about the behaviour
of branches on athlon-xp, but the above shows that they disagree about
whether the branch should be taken.  IIRC, with a cold branch target
cache on athlon-xp, forward branches are predicted as not taken and
backwards branches are predictied as taken (so that loops work right).
Since returning is the unusual case, we want to branch to it.  This
is what happens on i386 and nocona, but not on athlon-xp (with both
compilers).  Perhaps I remember the rule backwards but gcc knows it.
On plain i386, taken branches are slower, so branching to the return
is best, and is done.  OTOH, if forward branches are predicted as being
not taken, and the programmer tries to for optimize this without using
__predict_foo(), then the program order should probably be:

 	if (foo)
 		usual_case();
 	else
 		unusual_case();

This conflicts with early returns probably being the unusual case,
since

 	if (foo)
 		return;
 	otherstuff();
 	if (bar)
 		return;
 	morestuff();

doesn't require excessive nesting like:

 	if (!foo) {
 		otherstuff();
 		if (!bar)
 			morestuff();
 	}
 	return;

so hopefully the compiler generates branches to returns.  But why does
gcc do the opposite on athlon-xp?

I think using __predict_foo() is only useful in much larger functions
than the ones here.  Most modern CPUs have very good branch predictors
so an extra branch or two doesn't matter (each predicted branch takes
~1 cycle, the same as an arithmetic OR operator).  I believe the wins
from using __predict_foo() are mainly when it allows moving large code
for the unusual case far away, leaving only small code for the usual
case nearby; the code for the unusual case then doesn't usually interfere
with caches.

> or is all this just guess-work?  I would much rather have
> the code say what it means unless there's real, measurable performance
> differences from doing otherwise.

The best optimization is extremely machine-dependent.  I usually hope
the compiler knows how to do it and don't try to optimize stuff like
this.  I usually find that the compiler barely knows what to do, but
it can do better than me across a range of CPUs.

Bruce