SSE in libthr

Sat Mar 28 06:15:48 UTC 2015

If SIMD instructions are used for string proceccing, and FPU(AVX)
contexts are NOT saved/restored properly on process (thread) switching,
possibly processed string is destroyed by other process (thread).
Can't it be a security risk? (Broken string parameter for syscalls, etc)

If so, FPU (AVX) contexts should be saved/restored at least on process
(thread) switching.

 *If SIMD instructions are NOT used in kernel and kernel modules at all,
  there would be no need for saving/restoring FPU contexts on
  interrupts.

It's not limited in system libraries. As Alan noted, third party
applications can use original string processing code using SIMD.

On Fri, 27 Mar 2015 17:43:14 -0700
Adrian Chadd <adrian at freebsd.org> wrote:

> On 27 March 2015 at 16:03, Alan Somers <asomers at freebsd.org> wrote:
> > On Fri, Mar 27, 2015 at 4:36 PM, Adrian Chadd <adrian at freebsd.org> wrote:
> >> hi,
> >>
> >> please don't try to microoptimise crap like strlen().
> >>
> >> The TL;DR for performant high-throughput code is: if strlen() or
> >> memcpy() is the thing that's costing you the most, you're doing it
> >> wrong.
> >>
> >>
> >>
> >> -adrian
> >
> > I respectfully disagree.  A well-optimized libc will benefit
> > _every_single_program_ that uses strlen.  That includes Apache, Samba,
> > Memcached, Quake, and basically every single program that every single
> > FreeBSD user uses.  There's no reason that 3rd party software
> > maintainers should have to rewrite basic libc functions in order to
> > get decent performance on FreeBSD.  And the downsides are so small!
> > In 2015, we should assume by default that most userland software is
> > using SIMD instructions.  As Eric noticed, Clang emits them freely.
> > What's the point to lazily saving the SSE registers on context
> > switches if essentially all programs compiled from Ports will be using
> > those registers anyway?  I agree with Jilles; I think we should always
> > save the SSE registers for userland programs.
> 
> That's fine, but those benchmarks and improvements also have to take
> into account the environment that these programs are running in, and
> all of the other things that are going on with it.
> 
> Fixing strlen() to use SSE2 is great, but if the gains are offset by
> fpu save/restore when doing fine grain locking that's blocking under
> real world workloads, what's the benefit? What about if the system is
> context switching over a million times a second? These are real life
> things I see servers running all of the above software /do/.
> 
> One only knows with benchmarking, not microbenchmarking.
> 
> Microbenchmarks are great. They serve a purpose, which is "how the
> heck is the current silicon I'm running on run some code that I've
> cleverly crafted to hopefully run well."
> 
> I'm totally for saving/restoring SSE registers for userland programs.
> But that's not where that kind of "make stuff fast" work should stop.
> If it does, and that's where your benchmarking for the real world
> stops, then you're doing it wrong.
> 
> Everything is a toss-up. For this userland based netmap packet pushing
> app, SEE may be nice for some instructions, but know what else screws
> things? The fact that the default scheduler policy is terrible and
> crap gets scheduled /everywhere/ under any appreciable amount of load.
> That the context switch rate is high, the interrupt rate is also high,
> and with a little locking going on, I see fpu save/restore occur for a
> non-insignificant fraction of CPU. Optimising strlen() or memcpy() is
> great, but when my system context switches a million times a second,
> we're never going to reach the steady state that these CPUs can really
> crank out real work at under those conditions.
> 
> So, cool. Please keep poking at that stuff. But if you stop short of
> making the system actually /be able to take advantage of them under
> load/, I respectfully ask for a nice knob I can use to turn them off.
> :)
> 
> 
> 
> -adrian
> 
> (Know where the slowdowns for memcached are? Hint - not strlen or
> memcpy. Yes, I've been down that rabbit hole recently. Know what /i/
> have? 1 million UDP transactions a second working on 16 core
> sandybridge systems. Know what I didn't optimise? memcpy or strlen.
> The network stack locking and pthreads overhead is what sucks.)
> _______________________________________________
> freebsd-current at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-current
> To unsubscribe, send any mail to "freebsd-current-unsubscribe at freebsd.org"
> 

-- 
青木 知明  [Tomoaki AOKI]
    junchoon at dec.sakura.ne.jp
    MXE02273 at nifty.com