svn commit: r313006 - in head: sys/conf sys/libkern sys/libkern/x86 sys/sys tests/sys/kern

Tue Jan 31 06:25:34 UTC 2017

Hi Bruce,

On Mon, Jan 30, 2017 at 9:26 PM, Bruce Evans <brde at optusnet.com.au> wrote:
> On Tue, 31 Jan 2017, Conrad E. Meyer wrote:
>
>> Log:
>>  calculate_crc32c: Add SSE4.2 implementation on x86
>
>
> This breaks building with gcc-4.2.1,

gcc-4.2.1 is an ancient compiler.  Good riddance.

>> Added: head/sys/libkern/x86/crc32_sse42.c
>>
>> ==============================================================================
>> --- /dev/null   00:00:00 1970   (empty, because file is newly added)
>> +++ head/sys/libkern/x86/crc32_sse42.c  Tue Jan 31 03:26:32 2017
>> (r313006)
>> +
>> +#include <nmmintrin.h>
>
> ...
>
> Inline asm is much less unportable than intrinsics.  kib used the correct
> method of .byte's in asms to avoid depending on assembler support for newer
> instructions.  .byte is still used for clflush on amd64 and i386.  It
> used to be used for invpcid on amd64.  I can't find where it is or was
> used for xsave stuff.

Konstantin predicted this complaint in code review (phabricator).
Unfortunately, Clang does not automatically unroll asms, even with the
correct mnemonic.  Unrolling is essential to performance below the
by-3 block size (768 bytes in this implementation).  Hand unrolling in
C seems to generate less efficient assembly than the compiler's
unrolling.

The left column below is block size.  The measurements are nanoseconds
per buf, per CLOCK_VIRTUAL, averaged over 10^5 loops.  These numbers
do not vary more than +/- 1ns run to run on my idle Sandy Bridge
laptop.  "asm" is using __asm__(), "intrins" using the _mm_crc32
intrinsics that Clang can unroll, and multitable is the older
lookup-table implementation (still used on other architectures).

0x000010: asm:0 intrins:0 multitable:0  (ns per buf)
0x000020: asm:7 intrins:9 multitable:78  (ns per buf)
0x000040: asm:10 intrins:7 multitable:50  (ns per buf)
0x000080: asm:15 intrins:9 multitable:91  (ns per buf)
0x000100: asm:25 intrins:17 multitable:178  (ns per buf)
0x000200: asm:55 intrins:38 multitable:347  (ns per buf)
0x000400: asm:61 intrins:62 multitable:684  (ns per buf)

Both implementations are superior to the multitable approach, so it is
unreasonable not to make one of them standard on x86 platforms.

The unrolled intrinsics are consistently better than not unrolled on
objects 0x40-0x200 bytes large.  At 0x400 bytes we pass the first
unroll-by-3 threshold and it stops mattering as much.

At 0x40 bytes, it is the difference between 6.4 GB/s and 9.1 GB/s.  At
0x200 bytes, it is the difference between 9.3 GB/s and 13.5 GB/s.  I
think this justifies some minor ugliness.

Best,
Conrad