svn commit: r222908 - head/lib/libstand

Fri Jun 10 07:08:57 UTC 2011

On Fri, 10 Jun 2011, Craig Rodrigues wrote:

> Log:
>  Obtain bswap64() function implementation from
>  version 1.3 of src/common/lib/libc/gen/bswap64.c in NetBSD.
>
>  Obtained from: NetBSD

Why not use the better (actually, differently bad, but in the correct
place to fix) implementation already in FreeBSD (in <machine/endian.h>?
libstand isn't so standalone that it can't use basic system headers.

> Modified: head/lib/libstand/bswap.c
> ==============================================================================
> --- head/lib/libstand/bswap.c	Thu Jun  9 23:12:23 2011	(r222907)
> +++ head/lib/libstand/bswap.c	Fri Jun 10 01:13:15 2011	(r222908)
> ...
> @@ -30,12 +30,28 @@ bswap32(u_int32_t x)
>
> u_int64_t
> bswap64(u_int64_t x)
> -{
> -	u_int32_t *p = (u_int32_t*)&x;
> -	u_int32_t t;
> -	t = bswap32(p[0]);
> -	p[0] = bswap32(p[1]);
> -	p[1] = t;
> -	return x;
> -}

This was adequate on some arches, but has style bugs and bad type puns
which might break it on other arches.

> -
> +{
> +#ifdef _LP64
> +	/*
> +	 * Assume we have wide enough registers to do it without touching
> +	 * memory.
> +	 */

There are negative reasons to optimize this for libstand, but
<machine/endian.h> already does similar optimizations, with messy ifdefs
for each of about 10 different arches.

> +	return  ( (x << 56) & 0xff00000000000000UL ) |
> +		( (x << 40) & 0x00ff000000000000UL ) |
> +		( (x << 24) & 0x0000ff0000000000UL ) |
> +		( (x <<  8) & 0x000000ff00000000UL ) |
> +		( (x >>  8) & 0x00000000ff000000UL ) |
> +		( (x >> 24) & 0x0000000000ff0000UL ) |
> +		( (x >> 40) & 0x000000000000ff00UL ) |
> +		( (x >> 56) & 0x00000000000000ffUL );

This has different style bugs (excessive whitespace), but it otherwise
formatted better than the similar expression which is spammed into too
many <machine/endian.h>'s.  This expression is MI except for the constants
in it.  But it is far from optimal with some compilers, so too many of
<machine/endian.h>'s only use it when a compile-time constant expression
is needed -- they use asms with bswap instructions in them in other cases.
OTOH, this expression is optimal for clang, both for amd64 and i386 IIRC
-- clang turns it into 1 bswap for amd64, and (after adjusting the constants)
into 2 bswaps for i386.  gcc3.x and gcc-4.2.1 generate large slow code for
this expression.

> +#else
> +	/*
> +	 * Split the operation in two 32bit steps.
> +	 */
> +	u_int32_t tl, th;
> +
> +	th = bswap32((u_int32_t)(x & 0x00000000ffffffffULL));
> +	tl = bswap32((u_int32_t)((x >> 32) & 0x00000000ffffffffULL));
> +	return ((u_int64_t)th << 32) | tl;

This is a good way to avoid the type puns in the original version,
except for the style bugs in it (use of the long long abomination, in
a way that doesn't even have any effect).  On i386, this results in
essentially the same good code that clang generates for the big MI
expression (one (virtual) swap of a register pair to swap 2 32-bit
words, then 2 bswaps to swap the words internally. then more virtual
instructions to recombine the words), provided bswap32 is inline and
reduces to the bswap instruction, either via an asm or by optimization
(but in libstand it is neither).  (At least if bswap32() in the above
is replaced by _bswap32() from <machine/endian.h>, clang optimizes the
above to exactly the same 2 bswaps as it does for the big MI expression;
gcc is not so good, and generates about 8 movl instructions instead
of the virtual swapping.)  The above is not so optimal for LP64
(even clang doesn't manage to convert the 2 32-bit bswaps to 1 64-bit
one, at least if bswap32() is in asm), but is adequate in libstand.

> +#endif
> +}

Anyway, the messy ifdefs and micro-optimizations for this should not be
duplicated here.  When gcc optimization catches up with where clang
already is, all the messy ifdefs and MD asms in <machine/endian.h>'s
can be replaced by single MI expressions, and everything that doesn't
roll its own endian functions will benefit automatically.

The different badness now in i386/include/endian.h is now:
- the big MI expressions for bswap32 and bswap64 (named __bswapNN_const)
   are spammed into this and all (?) other MD files
- these expressions are missing some casts.  For example, if the _x arg
   is either signed (with the sign bit set), or of a larger type (with
   higher bits set), then (_x) >> 56 gives a wrong value.
- these expressions are poorly formatted
- the big MI expression for bswap64 is used for both the "const" and the
   "var" cases.  This is good for the "const" case, but for the "var" case
   it pessimal for gcc.  This badness is sort of the opposite of the above
   for libstand -- i386/include/endian.h should optimize and use the !_LP64
   code above, while libstand doesn't need this optimization but has it.
   amd64 doesn't have this problem since it uses a MD asm for bswap64_var.

There is a PR about bswap64 not being optimized for i386.  Apparently,
the optimization actually matters for some cases (for something like
swab(3) on large data?).  See the PR for more details.  My followup
explains why fancy asms are not needed to fix this -- as above, the
big MI expression works perfectly for clang, but for gcc an optimization
like the above is good (generate 2 bswap instructions using the asms
for bswap32, but don't use asms to combine these.  But use a better C
expression than the above to combine them, so that gcc doesn't generate
so many movls).

Bruce