svn commit: r313006 - in head: sys/conf sys/libkern sys/libkern/x86 sys/sys tests/sys/kern

Wed Feb 1 10:42:41 UTC 2017

On Tue, 31 Jan 2017, Conrad Meyer wrote:

> On Tue, Jan 31, 2017 at 7:16 PM, Bruce Evans <brde at optusnet.com.au> wrote:
>> Another reply to this...
>>
>> On Tue, 31 Jan 2017, Conrad Meyer wrote:
>>
>>> On Tue, Jan 31, 2017 at 7:36 AM, Bruce Evans <brde at optusnet.com.au> wrote:
>>>>
>>>> On Tue, 31 Jan 2017, Bruce Evans wrote:
>>>> I
>>>> think there should by no alignment on entry -- just assume the buffer is
>>>> aligned in the usual case, and only run 4% slower when it is misaligned.
>>>
>>> Please write such a patch and demonstrate the improvement.
>>
>> It is easy to demonstrate.  I just put #if 0 around the early alignment
>> code.  The result seem too good to be true, so maybe I missed some
>> later dependency on alignment of the addresses:
>> - for 128-byte buffers and misalignment of 3, 10g takes 1.48 seconds with
>>   alignment and 1.02 seconds without alignment.
>> This actually makes sense, 128 bytes can be done with 16 8-byte unaligned
>> crc32q's.  The alignment code makes it do 15 * 8-but and (5 + 3) * 1-byte.
>> 7 more 3-cycle instructions and overhead too is far more than the cost
>> of letting the CPU do read-combining.
>> - for 4096-byte buffers, the difference is insignificant (0.47 seconds for
>>   10g.
>
> I believe it, especially for newer amd64.  I seem to recall that older
> x86 machines had a higher misalignment penalty, but it was largely
> reduced in (?)Nehalem.  Why don't you go ahead and commit that change?

Needs more work.

Here are fairly clean patches for the unportabilities.  I can't really
test these in the kernel.  Merge the asms with yours if yours are better
in parts.

X Index: conf/files.amd64
X ===================================================================
X --- conf/files.amd64	(revision 313007)
X +++ conf/files.amd64	(working copy)
X @@ -536,6 +536,9 @@
X  isa/vga_isa.c			optional	vga
X  kern/kern_clocksource.c		standard
X  kern/link_elf_obj.c		standard
X +libkern/x86/crc32_sse42.c	standard
X +libkern/memmove.c		standard
X +libkern/memset.c		standard
X  #
X  # IA32 binary support
X  #
X @@ -593,14 +596,6 @@
X  compat/ndis/subr_usbd.c		optional	ndisapi pci
X  compat/ndis/winx64_wrap.S	optional	ndisapi pci
X  #
X -crc32_sse42.o			standard				\
X -	dependency	"$S/libkern/x86/crc32_sse42.c"			\
X -	compile-with	"${CC} -c ${CFLAGS:N-nostdinc} ${WERROR} ${PROF} -msse4 ${.IMPSRC}" \
X -	no-implicit-rule						\
X -	clean		"crc32_sse42.o"
X -libkern/memmove.c		standard
X -libkern/memset.c		standard
X -#
X  # x86 real mode BIOS emulator, required by dpms/pci/vesa
X  #
X  compat/x86bios/x86bios.c	optional x86bios | dpms | pci | vesa

This also fixes unsorting by crc32_sse42.o and nearby unsorting by all other
libkern lines (just 2) in files.amd64.  This file is still grossly unsorted
near the end.

X Index: conf/files.i386
X ===================================================================
X --- conf/files.i386	(revision 313007)
X +++ conf/files.i386	(working copy)
X @@ -554,11 +554,6 @@
X  kern/imgact_aout.c		optional compat_aout
X  kern/imgact_gzip.c		optional gzip
X  kern/subr_sfbuf.c		standard
X -crc32_sse42.o			standard				\
X -	dependency	"$S/libkern/x86/crc32_sse42.c"			\
X -	compile-with	"${CC} -c ${CFLAGS:N-nostdinc} ${WERROR} ${PROF} -msse4 ${.IMPSRC}" \
X -	no-implicit-rule						\
X -	clean		"crc32_sse42.o"
X  libkern/divdi3.c		standard
X  libkern/ffsll.c			standard
X  libkern/flsll.c			standard
X @@ -569,6 +564,7 @@
X  libkern/ucmpdi2.c		standard
X  libkern/udivdi3.c		standard
X  libkern/umoddi3.c		standard
X +libkern/x86/crc32_sse42.c	standard
X  i386/xbox/xbox.c		optional xbox
X  i386/xbox/xboxfb.c		optional xboxfb
X  dev/fb/boot_font.c		optional xboxfb

This also fixes unsorting by crc32_sse42.o.  files.i386 is not as unsorted
as files.i386.

The rules for sorting generated files are unclear.  They are not in source
directories so are nor naturally grouped.  Old ones are mostly placed near
the beginning.  Some are sorted on a pathname of the source file(s) in
scattered places in the make rules.  crc32_sse42.o is partly like that.
It was sorted with libkern sources but not by the x86 part.

X Index: libkern/x86/crc32_sse42.c
X ===================================================================
X --- libkern/x86/crc32_sse42.c	(revision 313007)
X +++ libkern/x86/crc32_sse42.c	(working copy)
X @@ -31,15 +31,41 @@
X   */
X  #ifdef USERSPACE_TESTING
X  #include <stdint.h>
X +#include <stdlib.h>

Fix dependency on namespace pollution in <nmmintrin.h> for at least size_t.

X  #else
X  #include <sys/param.h>
X +#include <sys/systm.h>
X  #include <sys/kernel.h>
X -#include <sys/libkern.h>
X -#include <sys/systm.h>
X  #endif

Fix order of systm.h and not using its standard pollution.

X 
X -#include <nmmintrin.h>
X +static __inline uint32_t
X +_mm_crc32_u8(uint32_t x, uint8_t y)
X +{
X +	/*
X +	 * clang (at least 3.9.[0-1]) pessimizes "rm" (y) and "m" (y)
X +	 * significantly and "r" (y) a lot by copying y to a different
X +	 * local variable (on the stack or in a register), so only use
X +	 * the latter.  This costs a register and an instruction but
X +	 * not a uop.
X +	 */
X +	__asm("crc32b %1,%0" : "+r" (x) : "r" (y));
X +	return (x);
X +}
X 
X +static __inline uint32_t
X +_mm_crc32_u32(uint32_t x, int32_t y)
X +{
X +	__asm("crc32l %1,%0" : "+r" (x) : "r" (y));
X +	return (x);
X +}
X +
X +static __inline uint64_t
X +_mm_crc32_u64(uint64_t x, int64_t y)
X +{
X +	__asm("crc32q %1,%0" : "+r" (x) : "r" (y));
X +	return (x);
X +}
X +
X  /* CRC-32C (iSCSI) polynomial in reversed bit order. */
X  #define POLY	0x82f63b78
X

>> I hoped that an even shorter SHORT case would work.  I think it now handles
>> 768 bytes (3 * SHORT) in the inner loop.
>
> Right.
>
>> That is 32 sets of 3 crc32q's,
>> and I would have thought that update at the end would take about as long
>> as 1 iteration (3%), but it apparently takes 33%.
>
> The update at the end may be faster with PCLMULQDQ.  There are
> versions of this algorithm that use that in place of the lookup-table
> combine (for example, Linux has a permissively licensed implementation
> here: http://lxr.free-electrons.com/source/arch/x86/crypto/crc32c-pcl-intel-asm_64.S
> ).
>
> Unfortunately, PCLMULQDQ uses FPU state, which is inappropriate most
> of the time in kernel mode.  It could be used opportunistically if the
> thread is already in FPU-save mode or if the input is "big enough" to
> make it worth it.

That rules it out for most kernel purposes.

I will try to optimize the multiplication.

>>> 0x000400: asm:68 intrins:62 multitable:684  (ns per buf)
>>> 0x000800: asm:132 intrins:133  (ns per buf)
>>> 0x002000: asm:449 intrins:446  (ns per buf)
>>> 0x008000: asm:1501 intrins:1497  (ns per buf)
>>> 0x020000: asm:5618 intrins:5609  (ns per buf)
>>>
>>> (All routines are in a separate compilation unit with no full-program
>>> optimization, as they are in the kernel.)
>>
>> These seem slow.  I modified my program to test the actual kernel code,
>> and get for 10gB on freefall's Xeon (main times in seconds):
>
> Freefall has a Haswell Xeon @ 3.3GHz.  My laptop is a Sandy Bridge
> Core i5 @ 2.6 GHz.  That may help explain the difference.

I thought that Sandy Bridge would be almost as fast here.

ref11-i386's lesser Xeon is 5.5 times slower in cycles for a block size
of 128K and 3.7 times slower for a block size of 128.  i386 is naturally
twice as slow.

ref11-amd64's lesser Xeon has the same speed as freefall.  So there must
be a bug in the i386 code... it is simply that gcc doesn't optimize the
buggy uint64_t type for the crcs in the 32-bit case, but clang does.
After fixing this, the i386 case is only slightly more than twice as
slow.  Untested fix:

Y Index: libkern/x86/crc32_sse42.c
Y ===================================================================
Y --- libkern/x86/crc32_sse42.c	(revision 313007)
Y +++ libkern/x86/crc32_sse42.c	(working copy)
Y @@ -190,7 +216,11 @@
Y  	const size_t align = 4;
Y  #endif
Y  	const unsigned char *next, *end;
Y -	uint64_t crc0, crc1, crc2;      /* need to be 64 bits for crc32q */
Y +#ifdef __amd64__
Y +	uint64_t crc0, crc1, crc2;
Y +#else
Y +	uint32_t crc0, crc1, crc2;
Y +#endif
Y 
Y  	next = buf;
Y  	crc0 = crc;

Bruce