svn commit: r313006 - in head: sys/conf sys/libkern sys/libkern/x86 sys/sys tests/sys/kern

Tue Jan 31 15:37:34 UTC 2017

On Tue, 31 Jan 2017, Bruce Evans wrote:

> On Mon, 30 Jan 2017, Conrad Meyer wrote:
>
>> On Mon, Jan 30, 2017 at 9:26 PM, Bruce Evans <brde at optusnet.com.au> wrote:
>>> On Tue, 31 Jan 2017, Conrad E. Meyer wrote:
>>> 
>>>> Log:
>>>>  calculate_crc32c: Add SSE4.2 implementation on x86
>>> 
>>> This breaks building with gcc-4.2.1,
>> 
>> gcc-4.2.1 is an ancient compiler.  Good riddance.
>
> I prefer it.

It also works better on ordinary asms for crc32.

>>>> Added: head/sys/libkern/x86/crc32_sse42.c
>>>> 
>>>> ==============================================================================
>>>> --- /dev/null   00:00:00 1970   (empty, because file is newly added)
>>>> +++ head/sys/libkern/x86/crc32_sse42.c  Tue Jan 31 03:26:32 2017
>>>> (r313006)
>>>> +
>>>> +#include <nmmintrin.h>
>>> 
>>> ...
>>> 
>>> Inline asm is much less unportable than intrinsics.  kib used the correct
>>> method of .byte's in asms to avoid depending on assembler support for 
>>> newer
>>> instructions.  .byte is still used for clflush on amd64 and i386.  It
>>> used to be used for invpcid on amd64.  I can't find where it is or was
>>> used for xsave stuff.
>> 
>> Konstantin predicted this complaint in code review (phabricator).
>> Unfortunately, Clang does not automatically unroll asms, even with the
>> correct mnemonic.  Unrolling is essential to performance below the
>> by-3 block size (768 bytes in this implementation).  Hand unrolling in
>> C seems to generate less efficient assembly than the compiler's
>> unrolling.
>
> Unorolling is almost completely useless if the instruction takes 3 cycles
> like you said it does.  The loop calculations run much faster than that,
> so they can run in parallel.  However, clang generates horrible code for
> inline asms instead of intrinsics.

The unrolling was worse than useless (see previous reply).  It is the
horrible code in the non-intrinsics case that seems to be the main source
of differences (see below).

>> The left column below is block size.  The measurements are nanoseconds
>> per buf, per CLOCK_VIRTUAL, averaged over 10^5 loops.  These numbers
>> do not vary more than +/- 1ns run to run on my idle Sandy Bridge
>> laptop.  "asm" is using __asm__(), "intrins" using the _mm_crc32
>> intrinsics that Clang can unroll, and multitable is the older
>> lookup-table implementation (still used on other architectures).
>> 
>> 0x000010: asm:0 intrins:0 multitable:0  (ns per buf)
>> 0x000020: asm:7 intrins:9 multitable:78  (ns per buf)
>> 0x000040: asm:10 intrins:7 multitable:50  (ns per buf)
>> 0x000080: asm:15 intrins:9 multitable:91  (ns per buf)
>> 0x000100: asm:25 intrins:17 multitable:178  (ns per buf)
>> 0x000200: asm:55 intrins:38 multitable:347  (ns per buf)
>> 0x000400: asm:61 intrins:62 multitable:684  (ns per buf)
>> 
>> Both implementations are superior to the multitable approach, so it is
>> unreasonable not to make one of them standard on x86 platforms.
>> 
>> The unrolled intrinsics are consistently better than not unrolled on
>> objects 0x40-0x200 bytes large.  At 0x400 bytes we pass the first
>> unroll-by-3 threshold and it stops mattering as much.
>> 
>> At 0x40 bytes, it is the difference between 6.4 GB/s and 9.1 GB/s.  At
>> 0x200 bytes, it is the difference between 9.3 GB/s and 13.5 GB/s.  I
>> think this justifies some minor ugliness.
>
> If it matters, which I doubt, then the compiler shouldn't be trusted
> for unrolling.  It can only do better than handwritten unrolling if
> it has tables for the correct amount unrolling for every instruction
> for thousands of CPUs (since such tables are too hard to write manually).

It is the 3-way unrolling that makes the main difference.  clang is
clueless about unrolling in this function, but this 3-way unrolling is
in the C code.  It is to reduce dependencies.  The timing seems to be
as I suspected -- the crc32 instruction has a latency of 3 and a throughput
of 3 per 3 cycles, so with 3 dependent instructions it runs 3 times slower
than with 3 independent instructions.  Naive unrolling gives more dependent
instructions, so is equally slow to no unrolling.

Test program for this:

X #include <stdint.h>
X #include <string.h>
X 
X #ifndef __clang__

Change this to __clang__X to kill intrinsics for clang.

X static __inline uint32_t
X _mm_crc32_u8(uint32_t x, uint8_t y)
X {
X 	__asm("crc32b %1,%0" : "+r" (x) : "rm" (y));
X 	return (x);
X }
X 
X static __inline uint32_t __unused
X _mm_crc32_u32(uint32_t x, int32_t y)
X {
X 	__asm("crc32l %1,%0" : "+r" (x) : "rm" (y));
X 	return (x);
X }
X 
X #ifdef __amd64__
X static __inline uint64_t
X _mm_crc32_u64(uint64_t x, int64_t y)
X {
X 	__asm("crc32q %1,%0" : "+r" (x) : "rm" (y));
X 	return (x);
X }
X #endif
X #else
X #include <nmmintrin.h>
X #endif
X 
X #define MISALIGN	0

MISALIGN 1 costs about 4% (0.36 seconds instead of 0.34 seconds on
freefall's Xeon).

X #define SIZE 		1000000

The large size of 1M is to bust at least the L1 cache.  This makes little
difference, since the speed is CPU-bound (0.34 seconds @ 3.3 GHz = ~34/33
cycles per crc32 instruction).  Actually, I'm not sure about the speeds.
freefalls's machdep.tsc_freq is 3.3GHz, but the speed could be anything
due to turbo mode and frequency control (boot message give similarly
useless values for the actual speed).  The test accesses 10g bytes or
1.25g crc32 instructions and that if the instructions take 1 cycle each
that is 0.38 seconds.  I think I'm measuring the CPU speed and not L2
cache speed since SIZE 1000 fits in L1 and gives the same speed.

X 
X uint8_t buf[MISALIGN + SIZE];
X 
X int
X main(void)
X {
X 	uint64_t crc1, crc2, crc3, crc4, crc5, crc6, crc7, crc8;
X 	int i, j;
X 
X 	crc1 = 0;
X 	crc2 = 0;
X 	crc3 = 0;
X 	for (i = 0; i < 10000; i++)
X 		for (j = MISALIGN; j < MISALIGN SIZE; j += 24) {
X 			crc1 = _mm_crc32_u64(crc1, *(uint64_t *)&buf[j + 0]);
X 			crc2 = _mm_crc32_u64(crc2, *(uint64_t *)&buf[j + 8]);
X 			crc3 = _mm_crc32_u64(crc3, *(uint64_t *)&buf[j + 16]);

I tried up to 8-way unrolling, with up to 8 crcN's or 8 crc steps into
crc1.  8 crcNs are no better than 3, and the common crc1 is always 3 times
slower.

X 		}
X 	return (crc1 + crc2 + crc3 == 0 ? 0 : 1);

Sloppy combination to get the result used.

X }

gcc generates the natural code for the inner loop and runs in 0.34 seconds:

G 	.p2align 4,,7
G .L3:
G #APP
G 	crc32q buf(%rdx),%rsi
G 	crc32q buf+8(%rdx),%rcx
G 	crc32q buf+16(%rdx),%rax
G #NO_APP
G 	addq	$24, %rdx
G 	cmpq	$1008, %rdx
G 	jne	.L3

clang with intrinsics unrolls twice and runs in 0.34 seconds:

C 	.p2align	4, 0x90
C .LBB0_2:                                # %for.body3
C                                         #   Parent Loop BB0_1 Depth=1
C                                         # =>  This Inner Loop Header: Depth=2
C 	crc32q	buf(%rdi), %rcx
C 	crc32q	buf+8(%rdi), %rsi
C 	crc32q	buf+16(%rdi), %rdx
C 	crc32q	buf+24(%rdi), %rcx
C 	crc32q	buf+32(%rdi), %rsi
C 	crc32q	buf+40(%rdi), %rdx
C 	addq	$48, %rdi
C 	cmpq	$1000, %rdi             # imm = 0x3E8
C 	jl	.LBB0_2

clang without intrinsics generates a mess and runs in 0.43 seconds:

C 	.p2align	4, 0x90
C .LBB0_2:                                # %for.body3
C                                         #   Parent Loop BB0_1 Depth=1
C                                         # =>  This Inner Loop Header: Depth=2
C 	movq	buf(%rdi), %rax
C 	movq	%rax, -24(%rbp)
C 	#APP
C 	crc32q	-24(%rbp), %rcx
C 	#NO_APP
C 	movq	buf+8(%rdi), %rax
C 	movq	%rax, -16(%rbp)
C 	#APP
C 	crc32q	-16(%rbp), %rsi
C 	#NO_APP
C 	movq	buf+16(%rdi), %rax
C 	movq	%rax, -8(%rbp)
C 	#APP
C 	crc32q	-8(%rbp), %rdx
C 	#NO_APP
C 	addq	$24, %rdi
C 	cmpq	$1000, %rdi             # imm = 0x3E8
C 	jl	.LBB0_2

Editing just this inner loop to the natural code gives a speed of 0.34 seconds.

freefall's Xeon is just short of resources to execute the extra instructions.
Using 4 independent crcN's didn't help, but using 8 gave a time of 0.36
seconds.

This is very MD, but my Haswell runs at the same speed in cycles as
freefall's Xeon.

I just noticed that the kernel code has plenty comments about the 3/1
latency/throughput and related optimizations.

Unrolling (or not) may be helpful or harmful for entry and exit code.  I
think there should by no alignment on entry -- just assume the buffer is
aligned in the usual case, and only run 4% slower when it is misaligned.

The exit code handles up to SHORT * 3 = 768 bytes, not up to 4 or 8
bytes or up to 3 times that like simpler algorithms.  768 is quite
large, and the exit code is quite slow.  It reduces 8 or 4 bytes at a
time without any dependency reduction, and then 1 byte at a time.  I
don't understand the algorithm for joining crcs -- why doesn't it work
to reduce to 12 or 24 bytes in the main loop?

Your benchmarks mainly give results for the <= 768 bytes where most of
the manual optimizations don't apply.  Compiler optimizations are more
likely to help there.  So I looked more closely at the last 2 loop.
clang indeed only unrolls the last one, only for the unreachable case
with more than 8 bytes on amd64.  Unrolling the previous one might be
useful, but slight extra complexity apparently prevents this.  In my
test program, the hard-coded sizes allow the compiler to at least know
that the unrolling is of reachable code.

Bruce