svn commit: r313006 - in head: sys/conf sys/libkern sys/libkern/x86 sys/sys tests/sys/kern
Conrad Meyer
cem at freebsd.org
Wed Feb 1 03:48:02 UTC 2017
On Tue, Jan 31, 2017 at 7:16 PM, Bruce Evans <brde at optusnet.com.au> wrote:
> Another reply to this...
>
> On Tue, 31 Jan 2017, Conrad Meyer wrote:
>
>> On Tue, Jan 31, 2017 at 7:36 AM, Bruce Evans <brde at optusnet.com.au> wrote:
>>>
>>> On Tue, 31 Jan 2017, Bruce Evans wrote:
>>> I
>>> think there should by no alignment on entry -- just assume the buffer is
>>> aligned in the usual case, and only run 4% slower when it is misaligned.
>>
>>
>> Please write such a patch and demonstrate the improvement.
>
>
> It is easy to demonstrate. I just put #if 0 around the early alignment
> code. The result seem too good to be true, so maybe I missed some
> later dependency on alignment of the addresses:
> - for 128-byte buffers and misalignment of 3, 10g takes 1.48 seconds with
> alignment and 1.02 seconds without alignment.
> This actually makes sense, 128 bytes can be done with 16 8-byte unaligned
> crc32q's. The alignment code makes it do 15 * 8-but and (5 + 3) * 1-byte.
> 7 more 3-cycle instructions and overhead too is far more than the cost
> of letting the CPU do read-combining.
> - for 4096-byte buffers, the difference is insignificant (0.47 seconds for
> 10g.
I believe it, especially for newer amd64. I seem to recall that older
x86 machines had a higher misalignment penalty, but it was largely
reduced in (?)Nehalem. Why don't you go ahead and commit that change?
>> perhaps we could just remove the 3x8k table — I'm not sure
>> it adds any benefit over the 3x256 table.
>
>
> Good idea, but the big table is useful. Ifdefing out the LONG case reduces
> the speed for large buffers from ~0.35 seconds to ~0.43 seconds in the
> setup below. Ifdefing out the SHORT case only reduces to ~0.39 seconds.
Interesting.
> I hoped that an even shorter SHORT case would work. I think it now handles
> 768 bytes (3 * SHORT) in the inner loop.
Right.
> That is 32 sets of 3 crc32q's,
> and I would have thought that update at the end would take about as long
> as 1 iteration (3%), but it apparently takes 33%.
The update at the end may be faster with PCLMULQDQ. There are
versions of this algorithm that use that in place of the lookup-table
combine (for example, Linux has a permissively licensed implementation
here: http://lxr.free-electrons.com/source/arch/x86/crypto/crc32c-pcl-intel-asm_64.S
).
Unfortunately, PCLMULQDQ uses FPU state, which is inappropriate most
of the time in kernel mode. It could be used opportunistically if the
thread is already in FPU-save mode or if the input is "big enough" to
make it worth it.
>>> Your benchmarks mainly give results for the <= 768 bytes where most of
>>> the manual optimizations don't apply.
>>
>>
>> 0x000400: asm:68 intrins:62 multitable:684 (ns per buf)
>> 0x000800: asm:132 intrins:133 (ns per buf)
>> 0x002000: asm:449 intrins:446 (ns per buf)
>> 0x008000: asm:1501 intrins:1497 (ns per buf)
>> 0x020000: asm:5618 intrins:5609 (ns per buf)
>>
>> (All routines are in a separate compilation unit with no full-program
>> optimization, as they are in the kernel.)
>
>
> These seem slow. I modified my program to test the actual kernel code,
> and get for 10gB on freefall's Xeon (main times in seconds):
Freefall has a Haswell Xeon @ 3.3GHz. My laptop is a Sandy Bridge
Core i5 @ 2.6 GHz. That may help explain the difference.
Best,
Conrad
More information about the svn-src-all
mailing list