svn commit: r232266 - in head/sys: amd64/include i386/include pc98/include x86/include

Fri Mar 2 13:41:51 UTC 2012

On Thu, Mar 01, 2012 at 11:46:20PM +0100, Tijl Coosemans wrote:
> On Wednesday 29 February 2012 08:44:54 Bruce Evans wrote:
> > On Wed, 29 Feb 2012, Bruce Evans wrote:
> >> I cleaned this up a bit according to ideas in my previous mails, and
> >> added a comment about the confusing use of __bswap64_const() (now
> >> named __bswap64_gen()) in __bswap64_var():

> > A minor problem with only having a macro version for __bswap64() turned
> > up:

> >> % -#define	__bswap16_const(_x)	(__uint16_t)((_x) << 8 | (_x) >> 8)
> >> % -
> >> % -#define	__bswap16(_x)			\
> >> % -	(__builtin_constant_p(_x) ?	\
> >> % -	    __bswap16_const((__uint16_t)(_x)) : __bswap16_var(_x))
> >> ...
> >> % +#define	___bswap16(x)	(__uint16_t)((x) << 8 | (x) >> 8)
> >> % +#define	__bswap16(x)	(___bswap16((__uint16_t)(x)))

> > When x a non-volatile variable, gcc and clang produce the good code
> > "rolw $8,x" for "x = __bswap16(x);" on short x.  But when x a a volatile
> > variable, gcc and clang produce fairly horrible code, with 2 loads of
> > x corresponding to the 2 accesses to x.  This is probably required by
> > volatile semantics, and is a problem for all unsafe macros, especially
> > when their name says that they are safe (oops).  When __bswap16 is
> > implemented as an inline function for the var case like it used to be,
> > it only loads x once and there are no problems with volatile variables.
> > Optimizing to "rolw $8,x" might still be possible iff x is not volatile,
> > but load-modify-store is probably better anyway.

Assuming there is no use having the value in a register, the instruction
with a memory operand tends to be slightly faster on modern CPUs and
(more importantly) is shorter than the three instructions necessary
otherwise.

> > So any macro version must use gcc features to be safe.  The following
> > seems to work:

> > #define	__bswap16(x)	__extension__ ({ __uint16_t __x = x;
> >  	(__uint16_t)(__x << 8 | __x >> 8); })

> > clang now produces "rolw $8,x" when x is volatile.  This seems to
> > violate volatile semantics.  gcc produces load-rolw-store then.  Both
> > produce "rolw $8,x" when x is not volatile.

I don't think volatile means there must be a race condition in a
read-modify-write even if there is only one core or only one thread.
Whether load-rolw-store or rolw on memory is used, the memory location
is read once and written once, as required by volatile semantics.

> I'll have a closer look at the patch tomorrow, but the Intel
> documentation for the bswap instruction recommends to use xchg for
> bswap16.

So does the AMD documentation. This seems strange because the rolw $8 is
both more flexible (allows memory operands and registers other than
%ax/%bx/%cx/%dx) and faster on almost all CPUs (per Agner Fog's
instruction tables, plus a penalty for recombining the high and low
bytes after the xchg on CPUs that split partial registers).

The xchg instruction is two bytes shorter but this is probably not worth
it except if you're writing a boot loader or similar.

Do take into account that documentation like "Instruction Set Reference"
is intended to describe how the instructions work, not how to use them
to obtain the very best performance on a particular chip.

-- 
Jilles Tjoelker