svn commit: r313006 - in head: sys/conf sys/libkern sys/libkern/x86 sys/sys tests/sys/kern

Wed Feb 1 03:16:54 UTC 2017

Another reply to this...

On Tue, 31 Jan 2017, Conrad Meyer wrote:

> On Tue, Jan 31, 2017 at 7:36 AM, Bruce Evans <brde at optusnet.com.au> wrote:
>> On Tue, 31 Jan 2017, Bruce Evans wrote:
>> Unrolling (or not) may be helpful or harmful for entry and exit code.
>
> Helpful, per my earlier benchmarks.
>
>> I
>> think there should by no alignment on entry -- just assume the buffer is
>> aligned in the usual case, and only run 4% slower when it is misaligned.
>
> Please write such a patch and demonstrate the improvement.

It is easy to demonstrate.  I just put #if 0 around the early alignment
code.  The result seem too good to be true, so maybe I missed some
later dependency on alignment of the addresses:
- for 128-byte buffers and misalignment of 3, 10g takes 1.48 seconds with
   alignment and 1.02 seconds without alignment.
This actually makes sense, 128 bytes can be done with 16 8-byte unaligned
crc32q's.  The alignment code makes it do 15 * 8-but and (5 + 3) * 1-byte.
7 more 3-cycle instructions and overhead too is far more than the cost
of letting the CPU do read-combining.
- for 4096-byte buffers, the difference is insignificant (0.47 seconds for
   10g.

>> I
>> don't understand the algorithm for joining crcs -- why doesn't it work
>> to reduce to 12 or 24 bytes in the main loop?
>
>It would, but I haven't implemented or tested that.  You're welcome to
>do so and demonstrate an improvement.  It does add more lookup table
>bloat, but perhaps we could just remove the 3x8k table â€” I'm not sure
>it adds any benefit over the 3x256 table.

Good idea, but the big table is useful.  Ifdefing out the LONG case reduces
the speed for large buffers from ~0.35 seconds to ~0.43 seconds in the
setup below.  Ifdefing out the SHORT case only reduces to ~0.39 seconds.
I hoped that an even shorter SHORT case would work.  I think it now handles
768 bytes (3 * SHORT) in the inner loop.  That is 32 sets of 3 crc32q's,
and I would have thought that update at the end would take about as long
as 1 iteration (3%), but it apparently takes 33%.

>> ...
>> Your benchmarks mainly give results for the <= 768 bytes where most of
>> the manual optimizations don't apply.
>
> 0x000400: asm:68 intrins:62 multitable:684  (ns per buf)
> 0x000800: asm:132 intrins:133  (ns per buf)
> 0x002000: asm:449 intrins:446  (ns per buf)
> 0x008000: asm:1501 intrins:1497  (ns per buf)
> 0x020000: asm:5618 intrins:5609  (ns per buf)
>
> (All routines are in a separate compilation unit with no full-program
> optimization, as they are in the kernel.)

These seem slow.  I modified my program to test the actual kernel code,
and get for 10gB on freefall's Xeon (main times in seconds):

0x000008: asm(rm):3.41 asm(r):3.07 intrins:6.01 gcc:3.74  (3S = 2.4ns/buf)
0x000010: asm(rm):2.05 asm(r):1.70 intrins:2.92 gcc:2.62  (2S = 3/2ns/buf)
0x000020: asm(rm):1.63 asm(r):1.58 intrins:1.62 gcc:1.61  (1.6S = 5.12ns/buf)
0x000040: asm(rm):1.07 asm(r):1.11 intrins:1.06 gcc:1.14  (1.1S = 7.04ns/buf)
0x000080: asm(rm):1.02 asm(r):1.04 intrins:1.03 gcc:1.04  (1.02S = 13.06ns/buf)
0x000100: asm(rm):1.02 asm(r):1.02 intrins:1.02 gcc:1.08  (1.02S = 52.22ns/buf)
0x000200: asm(rm):1.02 asm(r):1.02 intrins:1.02 gcc:1.02  (1.02S = 104.45ns/buf)
0x000400: asm(rm):0.58 asm(r):0.57 intrins:0.57 gcc:0.57  (.57S = 116.43ns/buf)
0x001000: asm(rm):0.62 asm(r):0.57 intrins:0.57 gcc:0.57  (.57S = 233.44ns/buf)
0x002000: asm(rm):0.48 asm(r):0.46 intrins:0.46 gcc:0.46  (.46S = 376.83ns/buf)
0x004000: asm(rm):0.49 asm(r):0.46 intrins:0.46 gcc:0.46  (.46S = 753.66ns/buf)
0x008000: asm(rm):0.49 asm(r):0.38 intrins:0.38 gcc:0.38  (.38S = 1245.18ns/buf)
0x010000: asm(rm):0.47 asm(r):0.38 intrins:0.36 gcc:0.38  (.36S = 2359.30ns/buf)
0x020000: asm(rm):0.43 asm(r):1.05 intrins:0.35 gcc:0.36  (.35S = 4587.52ns/buf)

asm(r) is a fix for clang's slownes with inline asms.  Just change the
constraint from "rm" to "r".  This takes an extra register, but no more
uops.

This is for the aligned case with no hacks.

intrins does something bad for small buffers.  Probably just the branch over
the dead unrolling.  Twice 2.4ns/buf for 8-byte buffers is still very fast.
This is 16 cycles.  3 cycles to do 1 crc32q and the rest mainly for 1 function
call and too many branches.

Bruce