SSE in libthr

Tomoaki AOKI junchoon at
Sat Mar 28 06:15:48 UTC 2015

If SIMD instructions are used for string proceccing, and FPU(AVX)
contexts are NOT saved/restored properly on process (thread) switching,
possibly processed string is destroyed by other process (thread).
Can't it be a security risk? (Broken string parameter for syscalls, etc)

If so, FPU (AVX) contexts should be saved/restored at least on process
(thread) switching.

 *If SIMD instructions are NOT used in kernel and kernel modules at all,
  there would be no need for saving/restoring FPU contexts on

It's not limited in system libraries. As Alan noted, third party
applications can use original string processing code using SIMD.

On Fri, 27 Mar 2015 17:43:14 -0700
Adrian Chadd <adrian at> wrote:

> On 27 March 2015 at 16:03, Alan Somers <asomers at> wrote:
> > On Fri, Mar 27, 2015 at 4:36 PM, Adrian Chadd <adrian at> wrote:
> >> hi,
> >>
> >> please don't try to microoptimise crap like strlen().
> >>
> >> The TL;DR for performant high-throughput code is: if strlen() or
> >> memcpy() is the thing that's costing you the most, you're doing it
> >> wrong.
> >>
> >>
> >>
> >> -adrian
> >
> > I respectfully disagree.  A well-optimized libc will benefit
> > _every_single_program_ that uses strlen.  That includes Apache, Samba,
> > Memcached, Quake, and basically every single program that every single
> > FreeBSD user uses.  There's no reason that 3rd party software
> > maintainers should have to rewrite basic libc functions in order to
> > get decent performance on FreeBSD.  And the downsides are so small!
> > In 2015, we should assume by default that most userland software is
> > using SIMD instructions.  As Eric noticed, Clang emits them freely.
> > What's the point to lazily saving the SSE registers on context
> > switches if essentially all programs compiled from Ports will be using
> > those registers anyway?  I agree with Jilles; I think we should always
> > save the SSE registers for userland programs.
> That's fine, but those benchmarks and improvements also have to take
> into account the environment that these programs are running in, and
> all of the other things that are going on with it.
> Fixing strlen() to use SSE2 is great, but if the gains are offset by
> fpu save/restore when doing fine grain locking that's blocking under
> real world workloads, what's the benefit? What about if the system is
> context switching over a million times a second? These are real life
> things I see servers running all of the above software /do/.
> One only knows with benchmarking, not microbenchmarking.
> Microbenchmarks are great. They serve a purpose, which is "how the
> heck is the current silicon I'm running on run some code that I've
> cleverly crafted to hopefully run well."
> I'm totally for saving/restoring SSE registers for userland programs.
> But that's not where that kind of "make stuff fast" work should stop.
> If it does, and that's where your benchmarking for the real world
> stops, then you're doing it wrong.
> Everything is a toss-up. For this userland based netmap packet pushing
> app, SEE may be nice for some instructions, but know what else screws
> things? The fact that the default scheduler policy is terrible and
> crap gets scheduled /everywhere/ under any appreciable amount of load.
> That the context switch rate is high, the interrupt rate is also high,
> and with a little locking going on, I see fpu save/restore occur for a
> non-insignificant fraction of CPU. Optimising strlen() or memcpy() is
> great, but when my system context switches a million times a second,
> we're never going to reach the steady state that these CPUs can really
> crank out real work at under those conditions.
> So, cool. Please keep poking at that stuff. But if you stop short of
> making the system actually /be able to take advantage of them under
> load/, I respectfully ask for a nice knob I can use to turn them off.
> :)
> -adrian
> (Know where the slowdowns for memcached are? Hint - not strlen or
> memcpy. Yes, I've been down that rabbit hole recently. Know what /i/
> have? 1 million UDP transactions a second working on 16 core
> sandybridge systems. Know what I didn't optimise? memcpy or strlen.
> The network stack locking and pthreads overhead is what sucks.)
> _______________________________________________
> freebsd-current at mailing list
> To unsubscribe, send any mail to "freebsd-current-unsubscribe at"

青木 知明  [Tomoaki AOKI]
    junchoon at
    MXE02273 at

More information about the freebsd-current mailing list