bin/106734: [patch] SSE2 optimization for bzip2/libbz2
mi+kde at aldan.algebra.com
Tue Jan 9 16:00:34 PST 2007
The following reply was made to PR bin/106734; it has been noted by GNATS.
From: Mikhail Teterin <mi+kde at aldan.algebra.com>
To: Julian Seward <jseward at acm.org>
Cc: bug-followup at freebsd.org
Subject: Re: bin/106734: [patch] SSE2 optimization for bzip2/libbz2
Date: Tue, 9 Jan 2007 18:34:36 -0500
On Sunday 07 January 2007 00:08, Julian Seward wrote:
= > /* Load the bytes: */
= > n1 = (__m128i)_mm_loadu_pd((double *)(block + i1));
= > n2 = (__m128i)_mm_loadu_pd((double *)(block + i2));
= > read beyond the end of the defined area of block. block is
= > defined for [0 .. nblock + BZ_N_OVERSHOOT - 1], but I think
= > you are doing a SSE load at &block[nblock + BZ_N_OVERSHOOT - 2],
= > hence loading 15 bytes of garbage.
I don't think, that's quite right... Instead of processing 8 bytes at a time,
as the non-SSE code is doing, I'm comparing 16 at a time. Thus it is possible
for me to be over by exactly 8 sometimes...
Anyway, the problem was stemming from my bumping i1 and i2 by 16 instead of 8
after the _initial check_ (which, in the quadrant-less case should not need
to be separate at all, actually). Sometimes _that_ would bring them over... I
think, the solution is to either bump up BZ_N_OVERSHOOT even further or check
and adjust i1 and i2:
if (i1 >= nblock)
i1 -= nblock;
if (i2 >= nblock)
i2 -= nblock;
at the beginning, rather than the end of the loop. Having done that, I no
longer peek beyond the end of the block (according to gdb's conditional
breakpoints, at least).
Please, check the new http://aldan.algebra.com/~mi/bz/blocksort-SSE2-patch-2
P.S. The following gdb-script is what I used. Run as:
gdb -x x.txt bzip2
cond 1 (i1 > nblock) || (i2 > nblock)
run -9 < /tmp/PLIST > /dev/null
andjust the compression level, the input's location, and be sure to have
blocksort.o compiled with debug information, of course...
More information about the freebsd-bugs