SSE in libthr

Sat Mar 28 07:34:18 UTC 2015

On Fri, Mar 27, 2015 at 10:40:57PM +0100, Jilles Tjoelker wrote:
> On Fri, Mar 27, 2015 at 03:26:17PM -0400, Eric van Gyzen wrote:
> > In a nutshell:
> 
> > Clang emits SSE instructions on amd64 in the common path of
> > pthread_mutex_unlock.  This reduces performance by a non-trivial
> > amount.  I'd like to disable SSE in libthr.
> 
> How about saving and restoring the FPU/SSE state eagerly instead of the
> current CR0.TS-based lazy method? There is overhead associated with #NM
> exception handling (fpudna) which is not worth it if FPU/SSE are used
> often. This would apply to userland threads only; kernel threads
> normally do not use FPU/SSE and handle the FPU/SSE state manually if
> they do.
First, we have no choice but saving the FPU context when a thread is
switched from.  It is not practical to try to keep the state in the
hardware, since fetching it to other core is too troublesome.

Second, the biggest overhead of #NM is the reading of FPU context from
memory (or cache), not the handler itself.  The save area for SSE-capable
machines, i.e. all amd64, is ~400 bytes, and XSAVEOPT does not help
much for reading of legacy FPU + XMM state.  It does help for YMM.

That said, your proposal would force all threads to pay higher cost at
the context switch time, increasing latency.

> 
> There is performance improvement potential in using SSE for optimizing
> string functions, for example. Even a simple SSE2 strlen easily
> outperforms the already optimized lib/libc/string/strlen.c in a
> microbenchmark, and many other string functions are slow byte-at-a-time
> implementations.

If the program does a lot of work with FPU between switches, the cost
is obviously mitigated.  Note that even for the worst case
of the reported microbenchmark, the measured overhead is ~10-15%.
So if string ops are indeed take significant share of the program time,
the FPU #NM handling cost should be very low even with the current
scheme.