state of kernel core primitives in aarch64

Mon Feb 22 19:20:46 UTC 2021

I made a quick trip and it looks like there is performance left on the
table. Similar remarks are probably applicable to userspace.

First some clean up: bzero, bcmp and bcopy are all defined as builtins
to memset, memcmp and memmove, but the kernel still provides them.
Arguably both bzero and bcmp can be a little bit faster than memset
(for knowing upfront the target is to be zeroed out) and memmcp (for
only having to note the difference instead of computing what it is).
If such optimizations are of significance of arm, builtins should be
changed at least on that arch. So happens clang provides
__builtin_bzero which do resort to calling relevant routines if
necessary.

Regardless of the above, all routines seem to be slower than they need
to, at least when I compare them to non-simd code in
https://github.com/ARM-software/optimized-routines/tree/master/string/aarch64

As a simple test I ran a simple test calling access(2) in a loop over
/usr/obj/usr/src/amd64.amd64/sys/GENERIC/vnode_if.c, running on ARM
Neoverse-N1 r3p1. This copies the string from userspace using
copyinstr and compares each component (usr, obj and so on) using
memcmp. According to dtrace[1] both copying and comparing are top of
the profile.

You can prod me on irc regarding hardware and benchmark code.

[1] dtrace seems to return a bogus result where sampling on
instructions reports return address instead and the conclusion was
made with that in mind

-- 
Mateusz Guzik <mjguzik gmail.com>