svn commit: r256132 - head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs

Wed Nov 6 03:45:18 UTC 2013

On Tue, 5 Nov 2013, John-Mark Gurney wrote:

> Andriy Gapon wrote this message on Tue, Nov 05, 2013 at 12:12 +0200:
>> on 08/10/2013 04:38 Xin LI said the following:
>>> Author: delphij
>>> Date: Tue Oct  8 01:38:24 2013
>>> New Revision: 256132
>>> URL: http://svnweb.freebsd.org/changeset/base/256132
>>>
>>> Log:
>>>   Improve lzjb decompress performance by reorganizing the code
>>>   to tighten the copy loop.
>>>
>>>   Submitted by:	Denis Ahrens <denis h3q com>
>>>   MFC after:	2 weeks
>>>   Approved by:	re (gjb)
>>>
>>> Modified:
>>>   head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/lzjb.c
>>>
>>> Modified: head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/lzjb.c
>>> ==============================================================================
>>> --- head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/lzjb.c	Mon Oct  7 22:30:03 2013	(r256131)
>>> +++ head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/lzjb.c	Tue Oct  8 01:38:24 2013	(r256132)
>>> @@ -117,7 +117,9 @@ lzjb_decompress(void *s_start, void *d_s
>>>  			src += 2;
>>>  			if ((cpy = dst - offset) < (uchar_t *)d_start)
>>>  				return (-1);
>>> -			while (--mlen >= 0 && dst < d_end)
>>> +			if (mlen > (d_end - dst))
>>> +				mlen = d_end - dst;
>>> +			while (--mlen >= 0)
>>>  				*dst++ = *cpy++;
>>>  		} else {
>>>  			*dst++ = *src++;
>>>
>>
>> Intuitively it might seem that this change is indeed an improvement.
>> But given how "not dumb" (not sure if that always amounts to smart) the modern
>> compilers are, has anyone actually measured if this change indeed improves the
>> performance?
>>
>> Judging from the conversations on the ZFS mailing lists this change event hurt
>> performance in some environments.
>> Looks like the upstream is not going to take this change.
>>
>> So does it make any sense for us to make the code more obscure and different
>> from upstream?  Unless performance benefits could indeed be demonstrated.
>
> The old code compiles to:
>   62c13:       49 f7 de                neg    %r14
>   62c16:       44 8d 58 ff             lea    -0x1(%rax),%r11d
>   62c1a:       4c 89 d3                mov    %r10,%rbx
>   62c1d:       0f 1f 00                nopl   (%rax)
>   62c20:       42 8a 14 33             mov    (%rbx,%r14,1),%dl
>   62c24:       88 13                   mov    %dl,(%rbx)
>   62c26:       48 ff c3                inc    %rbx
>   62c29:       ff c8                   dec    %eax
>   62c2b:       85 c0                   test   %eax,%eax
>   62c2d:       7f f1                   jg     62c20 <lzjb_decompress+0xa0>
>
> The load/store line is addresses 62c20-62c29...  So it looks like clang
> is already doing the optimization that was added...

It is unlikely to make a difference, but this is fairly bad source and
object code:
- byte at a time accesses.  The loop just copies data.  Even calling
   memcpy might be faster.  Compilers used to inline memcpy too much,
   but they now know that they don't understand memory, so the call
   would probably be external, though it should be inline iff the size
   of the copy is small
- the source code asks for, and gets a slow signed comparison.  The
   compiler is apparently not able to optimized, and the extra instruction
   at 62c2b is needed.
- rewrites to change the loop termination condition are often to change it
   from (--v >= 0) to (--v == 0).  This isn't done here, so there is still
   the extra instruction at 62c2b.  I don't like signed variables, this
   optimization often occurse automatically because size_t is used for
   byte counts and it is unsigned.  mlen is apparently signed and not
   size_t, else the optimization would occur automatically (if you write
   ">= 0", this makes no sense for unsigned variables, and the compiler
   will translate it to "== 0".
- oops, the extra instruction at 62c2b is not needed.  The compiler could
   have used "subl $1,%eax" instead of "decl %eax" to set the sign flags.

A minimal bytewise copying loop on 64-bit x86 looks like:

1:
 	movb	(%rsi,rcx,1),%al	# from top down
 	movb	%al,(%rdi,rcx,1)
 	decq	%rcx
 	jne	1b			# end with last byte not copied
 	...	# copy last byte now, or modfify end condition

4 instructions instead of 6.  But this probably makes little difference on
OOE x86.

This can use subl too, to allow using jg so as not to need an extra step
to copy the last byte.  Using decl is a good optimization for space on
32-bit values, but IIRC decq is not a single byte instruction so using it
is not such a good optimization for space and using subl loses less space.

This already copies from the top down so as to allow an end condition
of nearly %rcx == 0.  Perhaps the pointers and counters can be adjusted
a bit more to get to exactly %rcx == 0, but this probably has more
setup overheads.  Getting the end condition to be exactly '--v == 0'
tends to take more C code too.  Compilers know about this problem, and
sometimes generate an extra step outside of the loop even if you
structure the loop unnaturally to try to avoid this.

The loop could also be unrolled a bit.  ISTR noticing clang doing automatic
unrolling.  Perhaps this is the main thing affected by manual optimizations.
They might be harmful because the confuse the compiler into not doing
automatic unrolling, or harmful because they help the compiler do
automatic unrolling when it shouldn't, or helpful...

In my copy benchmarks, I didn't find reducing the loop overhead by
adjusting only the count register to be very useful, since loop overhead
is easy to hide in the src/dst access overhead.  When there are cache
misses for src/dst, loop overhead is in the noise, so all of this is
only interesting for the L1-cached case.  It usually takes only minor
(bit often some) unrolling to amortize the loop overhead in the
L1-cached case.

Bruce