bcopy/memmove optimization broken ? [looks like you are correct to me, I give supporting detail]

Alexandre Martins alexandre.martins at stormshield.eu
Mon Feb 13 08:27:41 UTC 2017


Le vendredi 10 février 2017, 18:11:23 Mark Millard a écrit :
> On 2017-Feb-10, at 7:51 AM, Alexandre Martins <alexandre.martins at 
stormshield.eu> wrote:
> > Hi all
> > 
> > I see in the kernel code of bcopy/memmove an optimization if the copied
> > region don't overlap:
> > 
> > https://svnweb.freebsd.org/base/head/sys/arm/arm/support.S?view=annotate#l
> > 403
> > 
> > I'm a newbie in ARM assembly, so, sorry if i'm wrong, but
> > - R0 is the dst (1st parameter)
> > - R1 is the src (2nd parameter)
> > 
> > So :
> > subcc r3, r0, r1 /* if (dst > src) r3 = dst - src */
> > subcs r3, r1, r0 /* if (src > dsr) r3 = src - dst */
> > cmp r3, r2 /* if (r3 < len) we have an overlap */
> > bcc PIC_SYM(_C_LABEL(memcpy), PLT)
> > 
> > Seems to be inverted. Should it be that ? :
> > subcs r3, r0, r1 /* if (dst > src) r3 = dst - src */
> > subcc r3, r1, r0 /* if (src > dsr) r3 = src - dst */
> > cmp r3, r2 /* if (r3 < len) we have an overlap */
> > bcs PIC_SYM(_C_LABEL(memcpy), PLT)
> > 
> > 
> > Best regards
> 
> I did not expect something so central that has been
> around so long to look wrong but. . .
> 
> Going through the supporting details of the original
> code based on my looking around here is what I found:
> 
> #include	<string.h>
> void *memmove(void *dest, const void *src, size_t n);
> 
> So I'd expect (c'ish notation):
> r0==dest
> r1==src
> r2==n
> 
> Then for (the comments vs. code is being challenged):
> (The comments do seem to give the intent correctly.)
> 
> cmp     r0, r1
> RETeq           /* Bail now if src/dst are the same */
> subcc   r3, r0, r1      /* if (dst > src) r3 = dst - src */
> subcs   r3, r1, r0      /* if (src > dsr) r3 = src - dst */
> cmp     r3, r2          /* if (r3 < len) we have an overlap */
> bcc     PIC_SYM(_C_LABEL(memcpy), PLT)
> . . .
> 
> cmp r0,r1 should result in condition code (c'ish notation):
> 
> eq=(r0==r1)
> cc=(r0<r1)
> cs=(r0>=r1)
> 
> (Only the r0 position has no immediate-value alternative.)
> 
> 
> subcc r3,r0,r1 is: if (cc) r3=r0-r1 // no condition code change
> In other words: if (dst<src) r3=dst-src
> 
> So it does not match the test listed in the comment as
> far as I can see. And in (mathematical) integer arithmetic
> the r3 result would be negative for dst-src. For
> non-negative arithmetic (natural or whole): undefined.
> 
> 
> subcs r3,r1,r0 is: if (cs) r3=r1-r0 // no condition code change
> In other words: if (dst>=src) r3=src-dst
> 
> So it does not match the test listed in the comment as
> far as I can see. And in (mathematical) integer arithmetic
> the r3 result would be nonpositive for src-dst. But the
> earlier RETeq prevents the zero case, so: negative. For
> non-negative arithmetic (natural or whole): undefined.
> 
> 
> If it was only a normal mathemetical context r3=-abs(dst-src)
> would be a summary of the two sub instruction sequence as it
> is from what I can tell.
> 
> For the purpose the summary should be: r3=abs(dst-src), given
> dst!=src . There is no need to wonder outside normal
> mathematical non-negative arithmetic in the process either.
> 
> Your code change would have the right summary and use only
> normal mathematical rules from what I can tell:
> 
> cmp     r0, r1
> RETeq           /* Bail now if src/dst are the same */
> subcs r3, r0, r1 /* if (dst > src) r3 = dst - src */
> subcc r3, r1, r0 /* if (src > dsr) r3 = src - dst */
> cmp r3, r2 /* if (r3 < len) we have an overlap */
> bcs PIC_SYM(_C_LABEL(memcpy), PLT)
> . . .
> 
> subcs r3,r0,r1 is: if (cs) r3=r0-r1 // no condition code change
> In other words: if (dst>=src) r3=dst-src.
> Given the prior RETeq, that is effectively: if (dst>src) r3=dst-src.
> And that matches the comments and would produce a positive result
> for r3, matching the normal mathematical result.
> 
> subcc r3,r1,r0 is: if (cc) r3=r1-r0 // no condition code change
> In other words: if (dst<src) r3=src-dst
> And that matches the comments and would produce a positive result
> for r3, matching the normal mathematical result.
> 
> Overall summary of the two updated sub?? instructions:
> r3=abs(dst-src), given dst!=src.
> 
> And that would make for an appropriate comparison to n (i.e., to r2).
> 
> It appears to have been as it is now since -r143175 when memmove was
> added to sys/arm/arm/support.S (2005-Apr-12).
> 
> 
> ===
> Mark Millard
> markmi at dsl-only.net



Thank you for this deep anaysis !

I also made some benchmark. It seems that the "Xscale" version of 
memcpy/memmove is slower that the standard "ARM" on my platform (armada388).

I do the change by undefine the _ARM_ARCH_5E.
#define _ARM_ARCH_5E => Xscale version
#undef _ARM_ARCH_5E => "ARM" version

There is my results:

Block size: 2048
memcpy (Kernel ARM)                  :   1028.7 MB/s
memmove (Kernel ARM)               :    616.5 MB/s
memcpy (Kernel xscale)               :    920.1 MB/s
memmove (Kernel xscale)            :    618.8 MB/s

Block size: 128
memcpy (Kernel ARM)                  :   1018.5 MB/s
memmove (Kernel ARM)               :    668.4 MB/s
memcpy (Kernel xscale)               :    825.9 MB/s
memmove (Kernel xscale)            :    668.6 MB/s

Block size: 64
memcpy (Kernel ARM)                  :    892.9 MB/s
memmove (Kernel ARM)               :    667.2 MB/s
memcpy (Kernel xscale)               :    721.2 MB/s
memmove (Kernel xscale)            :    668.2 MB/s

Block size: 32
memcpy (Kernel ARM)                  :    620.6 MB/s
memmove (Kernel ARM)               :    634.6 MB/s
memcpy (Kernel xscale)               :    504.9 MB/s
memmove (Kernel xscale)            :    634.5 MB/s

Block size; 16
memcpy (Kernel ARM)                  :    471.8 MB/s
memmove (Kernel ARM)               :    464.5 MB/s
memcpy (Kernel xscale)               :    254.5 MB/s
memmove (Kernel xscale)            :    464.7 MB/s

As you can see, more the size of the memcpy is small, more the standard (ARM) 
version of memcpy is faster.

In addition, the libc version suffer the same problem, but is 15% more efficiant.

What can I do to help you on this point ?

Best regards
-- 
Alexandre Martins
STORMSHIELD

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2874 bytes
Desc: not available
URL: <http://lists.freebsd.org/pipermail/freebsd-arm/attachments/20170213/501fa93e/attachment.bin>


More information about the freebsd-arm mailing list