cvs commit: src/include _ctype.h

Wed Oct 31 15:30:34 PDT 2007

On Mon, 29 Oct 2007, Christoph Mallon wrote:

> Andrey A. Chernov wrote:
>> ache        2007-10-27 22:32:28 UTC
>> 
>>   FreeBSD src repository
>> 
>>   Modified files:
>>     include              _ctype.h   Log:
>>   Micro-optimization of prev. commit, change
>>   (_c < 0 || _c >= 128) to (_c & ~0x7F)
>>     Revision  Changes    Path
>>   1.33      +1 -1      src/include/_ctype.h
>
> Actually this is rather a micro-pessimisation. Every compiler worth its money 
> transforms the range check into single unsigned comparison. The latter test 
> on the other hand on x86 gets probably transformed into a test instruction. 
> This instruction has no form with sign extended 8bit immediate, but only with 
> 32bit immediate. This results in a significantly longer opcode (three bytes 
> more) than a single (unsigned)_c > 127, which a sane compiler produces. I 
> suspect some RISC machines need one more instruction for the 
> "micro-optimised" code, too.
> In theory GCC could transform the _c & ~0x7F back into a (unsigned)_c > 127, 
> but it does not do this (the only compiler I found, which does this 
> transformation, is LLVM).
> Further IMO it is hard to decipher what _c & ~0x7F is supposed to do.

Indeed.

In fact, one of the cleanups/optimizations in rev.1.5 and 1.6 by ache
and me was to get rid of the mask.  There was already a check for _c
< 0, so the mask cost even more.  The top limit was 256 instead of
128, so the point about 8bit immediates didn't apply, but I don't know
of any machines where the mask is faster (didn't look hard :-).  OTOH,
_c is often a char or a u_char (it is declared as mumble_rune_t, but
the functions are inline so the compiler can see the original type.
If _c is u_char and u_char is uint8_t, then (_c < 0 || c >= 256) is
always false, so the compiler should generate no code for it.  The top
limit of 256 was preferred so that this optimization is possible.  A
top limit of 128 doesn't work so well.

I would have worried about the 1's complement case.  I think a mask
without a check for _c < 0 is plain broken in the 1's complement case,
but this case is too hard to think about -- just do a range comparison
which will always work, and let the compiler reduce it using 2's
complement or 1's complement tricks if possible, but since 1's complement
machines are rare, write the code so that it is easier for the compiler to
optimize in the 2's complement case.

Pipelining might make the old optimizations in ctype uninteresting.  Maybe
everything is almost free except for the table lookup (although that is
cached, it will sometimes miss).  I haven't timed this lately.

Bruce