svn commit: r313006 - in head: sys/conf sys/libkern sys/libkern/x86 sys/sys tests/sys/kern

Wed Feb 1 00:45:02 UTC 2017

On Tue, 31 Jan 2017, Conrad Meyer wrote:

> On Tue, Jan 31, 2017 at 7:36 AM, Bruce Evans <brde at optusnet.com.au> wrote:
>> On Tue, 31 Jan 2017, Bruce Evans wrote:
>> Unrolling (or not) may be helpful or harmful for entry and exit code.
>
> Helpful, per my earlier benchmarks.
>
>> I
>> think there should by no alignment on entry -- just assume the buffer is
>> aligned in the usual case, and only run 4% slower when it is misaligned.
>
> Please write such a patch and demonstrate the improvement.

I now understand the algorithm.

The division into 3 has to keep sequentiality (since CRC input is a bit
string), so it doesn't work to separate the memory acces of the 3 crc32's
by 8 bytes as in my simple test -- they have to be separated by large
amounts.  Then recombining requires multiplying the polynomials associated
with the CRCs of the higher 2 blocks by X^N and reducing again, where N is
large an related to the block size.  This is done using a large table for
each N needed, and to keep things reasonably simple, only 2 N's are used.

>> The exit code handles up to SHORT * 3 = 768 bytes, not up to 4 or 8
>> bytes or up to 3 times that like simpler algorithms.  768 is quite
>> large, and the exit code is quite slow.  It reduces 8 or 4 bytes at a
>> time without any dependency reduction, and then 1 byte at a time.
>
> Yes, this is the important loop to unroll for small inputs.  Somehow

Not like clang does it.  Unrolling is useless without the 3-way blocking.

> with the unrolling, it is only ~19% slower than the by-3 algorithm on
> my system â€” not 66%.  Clang 3.9.1 unrolls both of these trailing
> loops; here is the first:

Maybe 3.9.1 pessimizes the 3-way loop by unrolling it.  This would be
fairly easy to do.  Just replicate the 3 crc32q's a few times, say 8,
and do them in a bad order (3 blocks of 8 dependent ones instead of
8 blocks of 3 independent ones).  With enough replication, the code
would be too large for the hardware to reorder.  Inline asm has
another advantage here -- volatile on it prevents reordering and
might prevent unrolling.

Maybe 3.9.1 unpessimizes the inline asms.

But I suspect not getting the 3 times speedup is for another reason.

>
>   0x0000000000401b88 <+584>:   cmp    $0x38,%rbx
>   0x0000000000401b8c <+588>:   jae    0x401b93 <sse42_crc32c+595>
>   0x0000000000401b8e <+590>:   mov    %rsi,%rdx
>   0x0000000000401b91 <+593>:   jmp    0x401be1 <sse42_crc32c+673>
>   0x0000000000401b93 <+595>:   lea    -0x1(%rdi),%rbx
>   0x0000000000401b97 <+599>:   sub    %rdx,%rbx
>   0x0000000000401b9a <+602>:   mov    %rsi,%rdx
>   0x0000000000401b9d <+605>:   nopl   (%rax)
>   0x0000000000401ba0 <+608>:   crc32q (%rdx),%rax
>   0x0000000000401ba6 <+614>:   crc32q 0x8(%rdx),%rax
>   0x0000000000401bad <+621>:   crc32q 0x10(%rdx),%rax
>   0x0000000000401bb4 <+628>:   crc32q 0x18(%rdx),%rax
>   0x0000000000401bbb <+635>:   crc32q 0x20(%rdx),%rax
>   0x0000000000401bc2 <+642>:   crc32q 0x28(%rdx),%rax
>   0x0000000000401bc9 <+649>:   crc32q 0x30(%rdx),%rax
>   0x0000000000401bd0 <+656>:   crc32q 0x38(%rdx),%rax
>   0x0000000000401bd7 <+663>:   add    $0x40,%rdx
>   0x0000000000401bdb <+667>:   add    $0x8,%rbx
>   0x0000000000401bdf <+671>:   jne    0x401ba0 <sse42_crc32c+608>

No, this unrolling is useless.  The crc32q's are dependent on each other,
so they take 3 cycles each.  There are spare resources to run about 12
instructions during that time.  Loop control only takes 3.

>> I
>> don't understand the algorithm for joining crcs -- why doesn't it work
>> to reduce to 12 or 24 bytes in the main loop?
>
> It would, but I haven't implemented or tested that.  You're welcome to
> do so and demonstrate an improvement.  It does add more lookup table
> bloat, but perhaps we could just remove the 3x8k table â€” I'm not sure
> it adds any benefit over the 3x256 table.
>
>> Your benchmarks mainly give results for the <= 768 bytes where most of
>> the manual optimizations don't apply.

Actually, they test only the large buffer case.  They used buffer size
of 1M and 1k and didn't do the entry and exit code that usually dominates
for small buffers.

I re-tested with the correct blocking.  This was about 10% slower
(0.34 -> 0.37 seconds for 10GB), except for clang without intrinsics it
was 20% slower (0.43 -> 0.51) seconds.

> 0x000400: asm:68 intrins:62 multitable:684  (ns per buf)

I don't see any signs of this in my test:
- a single crc32q in a (C) loop doesn't benefit from unrolling or lose
   to the extra clang instructions without intrinsics.  clang-3.9.0
   unrolls this 8-way in the simpler environment of my test program,
   but this makes no difference.
- similarly for a single crc32b in a loop, except when I forgot to
   change the type of the crc accumulator from uint64_t to uint32_t,
   gcc was 1 cycle slower in the loop (3 instead of 4).  gcc generates
   an extra instruction to zero-extend the crc, and this is more
   expensive than usual since it gives gives another dependency.
   clang optimizes this away.

> 0x000800: asm:132 intrins:133  (ns per buf)
> 0x002000: asm:449 intrins:446  (ns per buf)
> 0x008000: asm:1501 intrins:1497  (ns per buf)
> 0x020000: asm:5618 intrins:5609  (ns per buf)

Now it is mostly in the 3-way optimized case and the differences are
in the noise.

> (All routines are in a separate compilation unit with no full-program
> optimization, as they are in the kernel.)
>
>> Compiler optimizations are more
>> likely to help there.  So I looked more closely at the last 2 loop.
>> clang indeed only unrolls the last one,
>
> Not in 3.9.1.

3.9.1 seems to only extend the useless unrolling.

>> only for the unreachable case
>> with more than 8 bytes on amd64.
>
> How is it unreachable?

Because the loop doing 8-byte words at a time reduces the count below 8.

Bruce