svn commit: r280279 - head/sys/sys

Sat Mar 21 22:41:58 UTC 2015

On Sat, 21 Mar 2015, John Baldwin wrote:

> On 3/21/15 12:35 PM, Konstantin Belousov wrote:
>> On Sat, Mar 21, 2015 at 12:04:41PM -0400, John Baldwin wrote:
>>> On 3/20/15 9:02 AM, Konstantin Belousov wrote:
>>>> On Fri, Mar 20, 2015 at 10:27:06AM +0000, John Baldwin wrote:
>>>>> Author: jhb
>>>>> Date: Fri Mar 20 10:27:06 2015
>>>>> New Revision: 280279
>>>>> URL: https://svnweb.freebsd.org/changeset/base/280279
>>>>>
>>>>> Log:
>>>>>   Expand the bitcount* API to support 64-bit integers, plain ints and longs
>>>>>   and create a "hidden" API that can be used in other system headers without
>>>>>   adding namespace pollution.
>>>>>   - If the POPCNT instruction is enabled at compile time, use
>>>>>     __builtin_popcount*() to implement __bitcount*(), otherwise fall back
>>>>>     to software implementations.
>>>> Are you aware of the Haswell errata HSD146 ?  I see the described behaviour
>>>> on machines back to SandyBridge, but not on Nehalems.
>>>> HSD146.   POPCNT Instruction May Take Longer to Execute Than Expected
>>>> Problem: POPCNT instruction execution with a 32 or 64 bit operand may be
>>>> delayed until previous non-dependent instructions have executed.
>>>>
>>>> Jilles noted that gcc head and 4.9.2 already provides a workaround by
>>>> xoring the dst register.  I have some patch for amd64 pmap, see the end
>>>> of the message.
>>>
>>> No, I was not aware, but I think it's hard to fix this anywhere but the
>>> compiler.  I set CPUTYPE in src.conf on my Ivy Bridge desktop and clang
>>> uses POPCOUNT for this function from ACPI-CA:
>>>
>>> static UINT8
>>> AcpiRsCountSetBits (
>>>     UINT16                  BitField)
>>> {
>>>     UINT8                   BitsSet;
>>>
>>>
>>>     ACPI_FUNCTION_ENTRY ();
>>>
>>>
>>>     for (BitsSet = 0; BitField; BitsSet++)
>>>     {
>>>         /* Zero the least significant bit that is set */
>>>
>>>         BitField &= (UINT16) (BitField - 1);
>>>     }
>>>
>>>     return (BitsSet);
>>> }
>>>
>>> (I ran into this accidentally because a kernel built on my system failed
>>> to boot in older qemu because the kernel paniced with an illegal instruction
>>> fault in this function.)

Does it do the same for the similar home made popcount in pmap?:

X /*
X  * Returns the number of one bits within the given PV chunk map element.
X  */
X static int
X popcnt_pc_map_elem(uint64_t elem)
X {
X 	int count;
X 
X 	/*
X 	 * This simple method of counting the one bits performs well because
X 	 * the given element typically contains more zero bits than one bits.
X 	 */
X 	count = 0;
X 	for (; elem != 0; elem &= elem - 1)
X 		count++;
X 	return (count);
X }

This is actually the same method as in ACPI.  It is just missing mounds
of style bugs (mostly Windows spellings; also a bogus cast to UINT16;
the cast has no effect).

Perhaps this appeared to perform well not for the reason stated in its
comment, but because it was tested with excessive CFLAGS, and the compiler
actually reduced it to popcntq, and the test machine didn't have the
bug.

In a game program that I wrote, efficiency of popcount() was actually
important since it was used in inner loops.  I used lookup tables for
popcount() and a couple of application-specific functions and even for
fls().  This seemed to be better than the the existing bsf instruction,
so it can't be bad for popcount().  The lookup was limited to bitsets
of length 13 in the low bits of an int, so the table sizes were only
2**13 = 4096.  bitsets of length 64 would need multiple steps.  However,
if the above performs well without compiler optimizations, then it must
be because usually only the lowest few bits are set.  Then it would
perform even better with even smaller (8-bit) tables combined with an
early exit:

 	count = 0;
 	for (; elem != 0; elem >>= 8)
 		count += lookup[elem &0xff];
 	return (count);

Many variations on this are possible.  E.g., use fls() to decide where to
start; or avoid all branches, and add up the results in parallel.  For
bitcount32, the latter is:

 	bitcount64(x) := bitcount32(x & 0xffffffff) + bitcount32(x >> 32);
 	bitcount32(x) := bitcount16(x & 0xffff) + bitcount16(x >> 16);
 	bitcount16(x) := bitcount8(x & 0xff) + bitcount8(x >> 8);
 	bitcount8(x) := lookup[x];

Compilers won't be able to optimize the lookup methods.  They might be
able to convert the current bitcount*() to popcnt*.  Last time I looked,
clang but not gcc converted very complicated expressions for bswap*()
to bswap* instructions.

>>> There's no way we can preemptively locate every bit of C that clang might
>>> decide to replace with popcount and fix them to employ a workaround.  I think
>>> the only viable solution is to use "-mno-popcount" on all compilers that aren't
>>> known to be fixed.

Why bother?  It is only a short-lived and small(?) efficiency bug.

>> Both you and Bruce and Jilles state this, but I am unable to understand the
>> argument about the compiler.  Can somebody explain this to me, please ?

It is also to avoid complications for short-lived optimizations.  The
3 uses of popcntq() in amd64 pmap cost:
- 9 lines of inline asm.
- 19 lines for the C version
- 9 lines instead of 3 for the 3 uses
When popcntq is not available, the result is a pessimization since the
optimizations are not complicated enough to be useful in that case.
They give the runtime overhead of a branch to call the C version that
is presumably worse than using the existing bitcount32() (possibly
twice) or the compiler builtin.
   (jhb's cleanups didn't bother to optimize to use the builtin in
   all cases, since it is hard to tell if the builtin is any good if
   it is in software.  If you remove the 19 lines for
   popcnt_pc_map_elem() and and replace calls to it by __builtin_popcountl(),
   then the results would be:
   - compile-time failure for old unsupported compilers that don't have this
     builtin
   - usually, a link-time failure for gcc-4.2 through gcc-4.8.  gcc
     generates a call to __popcountdi2 unless CFLAGS enables the hardware
     instruction.  __popcountdi2 is in libgcc but not in libkern.
   - usually, 3 inline copies of the same code that would be produced if
     FreeBSD's bitcount64() were used for clang.  clang unrolls things
     excessively, giving enormous code.  Unrolling allows it to load the
     large constants in __bitcount64() only once, but large code is still
     needed to use these constants.
   - runtime failures in misconfigured cases where CFLAGS doesn't match
     the hardware.  Both gcc-4.8 and clang produce popcntq.  The runtime
     check doesn't help since the compilers produce popcntq for the C
     case.  Using __builtin_popcountl() asks for this directly.  Using
     __bitcount64() asks for it indirectly, and gets it for clang.
     Using popcnt_pc_map_elem() may avoid getting it for clang, but
     apparentely still gets it.)
When popcntq is available, the test to decide whether to use it is a
small pessimization.  It might take longer than a branch-free software
method.  This is unlikely since there are 3 popcounts per test.

> If you compile with -mpopcount (or a march that includes it like -march=corei7)
> then the compiler will assume that popcount is available and will use it without
> doing any runtime checks.  In the case of my sandybridge desktop, I have
> CPUTYPE set to "corei7-avx" in /etc/make.conf which adds "-march=corei7-avx"
> to CFLAGS, so the compiler assumes it can use POPCOUNT (as well as newer
> SSE and AVX).  In this case the compiler recognized the pattern in the C
> function above as doing a population count and replaced the entire function with
> a POPCOUNT instruction even though the C code doesn't explicitly try to use it.

The runtime error for the unsupported instruction is probably not important,
since using misconfigured CFLAGS asks for problems.  In general, any new
instruction may trap, and the mismatch must be small for only popcntq to
trap.

> However, if we know that a compiler doesn't employ the workaround for this
> errata, we can employ -mno-popcount so that using a custom CPUTYPE does not
> result in use of POPCOUNT except for known-ok compilers.  Either that or
> we use a more elaborate set of #if checks in <sys/types.h> that only uses
> __builtin_popcount() for known-ok compilers (and possibly uses the
> clear-destination workaround for other compilers if -mpopcount is enabled).

I think the runtime slowness from a buggy instruction is also unimportant.

>> Compiler cannot generate popcntq instructions unless it is directed to
>> generate code not for amd64, but for some specific microacrchitecture.
>> Any use of bitcount in generic kernel or userspace is going to be a
>> SWAR. In other words, #ifdef __POPCNT__ is for non-default settings of
>> the FreeBSD compilation environment.
>
> Correct.
>
>> And even then, compiler must issue cpuid to fetch the feature mask, which
>> arguably compiler cannot do due to the serialization instruction cost.
>> Compiler could generate the call to instruction in the init code, but
>> this is impossible for freestanding environment, since compiler does not
>> know where to put init code.
>
> No, the compiler does not have to check cpuid.  If you've told it to compile
> for a given CPU type, it has tables to hardcode the features known to be
> present on those CPUs and just uses them.  If you use the wrong CPU type
> you get illegal instruction faults. :)  (As I did in my qemu test case.)
>
>>> In regards to whether or not this should be a dynamic test instead, it seems a
>>> bit much to require a dynamic check for code that has to work in both userland
>>> and the kernel, and I'm not sure if something like CPU_COUNT() would actually
>>> benefit from that versus just always using the software solution instead.
>>
>> I agree.
>
> Hence why I opted to only use POPCOUNT if someone explicitly enables it via
> -mpopcount or -march flags.  If you've told the compiler it's ok to use it
> without any checks, then it will use POPCOUNT, otherwise it will use the
> SWAR method.

Always using new API would lose the micro-optimizations given by the runtime
decision for default CFLAGS (used by distributions for portability).  To
keep them, it seems best to keep the inline asm but replace
popcnt_pc_map_elem(elem) by __bitcount64(elem).  -mno-popcount can then
be used to work around slowness in the software (that is actually
hardware) case.

Bruce