svn commit: r313006 - in head: sys/conf sys/libkern sys/libkern/x86 sys/sys tests/sys/kern

Thu Mar 2 13:54:47 UTC 2017

On Wed, 1 Mar 2017, Conrad Meyer wrote:

> On Wed, Mar 1, 2017 at 9:27 PM, Bruce Evans <brde at optusnet.com.au> wrote:
>> On Wed, 1 Mar 2017, Conrad Meyer wrote:
>>
>>> On my laptop (Intel(R) Core(TM) i5-3320M CPU â€” Ivy Bridge) I still see
>>> a little worse performance with this patch.  Please excuse the ugly
>>> graphs, I don't have a better graphing tool set up at this time:
>>>
>>> https://people.freebsd.org/~cem/crc32/sse42_bde.png
>>> https://people.freebsd.org/~cem/crc32/sse42_bde_log.png
>>
>> Try doubling the loop sizes.  There shouldn't be any significant difference
>> above size 3*LONG unless LONG is too small.  Apparently it is too small for
>> older CPUs.
>>
>> I now have a Sandybridge i5-2xxx laptop to test on, but don't have it set
>> up for much yet.
>
> Doubling the loop sizes seems to make it slightly worse, actually:
>
> https://people.freebsd.org/~cem/crc32/sse42_bde2.png
> https://people.freebsd.org/~cem/crc32/sse42_bde_log2.png
>
> I haven't made any attempt to inspect the generated assembly.  This is
> Clang 3.9.1 with -O2.

I tested on Sandybridge (i5=2540M) and get exactly the opposite results
with clang-3.9-0.  It is much slower with intrinsics.  Much slower than
gcc-4-2.1.  Perhaps a bug in one of the test programs (mine is enclosed).
Minimum types with low variance (+-10 msec_ for "./z2 size 10" (100G total)
in seconds on idle system:

      buf_size:     512  3*512  4096  3*4096
--------------   -----  -----  ----  ------
./z2-bde-clang   10.57   8.36  6.85    6.58
./z2-bde-gcc     10.99   8.96  7.08    6.58
./z2-cur-clang   17.23  11.19  6.97    6.75

Oops, that was with MISALIGN = 1.  Also, I forgot to force aligment of buf,
but checked it was at 0x...40 in all case.  Now with proper alignment:

      buf size:     512  3*512  4096  3*4096
--------------   -----  -----  ----  ------
./z2-bde-clang    8.96   6.56  6.62    6.42
./z2-bde-gcc      8.81   6.51  6.63    6.30
./z2-cur-clang   14.70   6.22  6.66    6.13

The number of iterations is adjusted so that buf_size * num_iter = 100G.

This shows that clang-3.9.0 with intrinsics is doing lots of
rearrangement which is very bad for the misaligned case and otherwise
helps for the multiple-of-3 cases (when the SHORT loop is null), and
otherwise is a small pessimization relative to no intrinsicts, but beats
gcc, while gcc does almost none.  (I mostly tested with gcc -O3 and it
seemed equally good then.)  The function doesn't use __predict_ugly(),
and clang apparently uses this to optimized the alignment code at
great cost to the main loops when the alignment code executes (perhaps
it removes the alignment code?)  clang also does poorly with buf_size
512 in the aligned case.

Indeed, gcc is much better with -O3 (other flags -static [-msse4 for
intrins]).  clang does excessive optimizations by default, and -O3
makes no difference for it:

      buf size:     512  3*512  4096  3*4096
--------------   -----  -----  ----  ------
./z2-bde-clangO3  8.96   6.56  6.62    6.42
./z2-bde-gccO3    8.95   6.06  6.11    5.80
./z2-cur-clangO3 14.70   6.22  6.66    6.13

So we seem to be mainly testing uninteresting compiler pessimizations.
Eventually compilers will understand the code better and not rearrange
it very much (except for the alignment part).

I did a quick test with LONG = SHORT = 128 and gcc -O2.  This was just
slower, even for the ideal loop size of 4096*3 (up from 6.30 to 6.67
seconds).  This change just removes the LONG loop after renaming the
SHORT loop to LONG.  gcc apparently thinks it understands this simpler
version, and pessimizes it.  While testing, I did notice a pessimization
that is not the compiler's fault: when the crc32 instructions are optimized
at the expense of the crc update at the end of the loop, the loop gets out
of sync with the update and the wrong thing can stall.  The code has
subtleties to try to prevent this, by compilers don't really understand
this.  Compiler membars to control the ordering precisely were just
pessimizations.

X #include <stdint.h>
X #include <stdio.h>
X #include <stdlib.h>
X #include <string.h>
X 
X #define MISALIGN	1
X #define SIZE		(1024 * 1024)
X 
X uint8_t buf[MISALIGN + SIZE];
X 
X uint32_t sse42_crc32c(uint32_t, const unsigned char *, unsigned);
X 
X int
X main(int argc, char **argv)
X {
X 	size_t size;
X 	uint32_t crc;
X 	int i, j, limit, repeat;
X 
X 	size = argc == 1 ? SIZE : atoi(argv[1]);
X 	limit = 10000000000L / size;
X 	repeat = argc < 3 ? 10 : atoi(argv[2]);
X 	for (i = 0; i < sizeof(buf); i++)
X 		buf[i] = rand();
X 	crc = 0;
X 	for (j = 0; j < repeat; j++)
X 		for (i = 0; i < limit; i++)
X 			crc = sse42_crc32c(crc, &buf[MISALIGN], size);
X 	printf("%#x\n", sse42_crc32c(0, &buf[MISALIGN], size));
X 	return (crc == 0 ? 0 : 1);
X }

Loops like this are not very representative of normal use, but I don't
know a better way.

Bruce