sem_post() performance

Wed Sep 24 14:12:05 UTC 2014

On Tuesday, September 23, 2014 11:20:00 PM Jilles Tjoelker wrote:
> On Mon, Sep 22, 2014 at 03:53:13PM -0400, John Baldwin wrote:
> > On Sunday, September 21, 2014 11:37:42 PM Jilles Tjoelker wrote:
> > > It has been reported that POSIX semaphores are slow, in contexts such as
> > > Python. Note that POSIX semaphores are the only synchronization objects
> > > that support use by different processes in shared memory; this does not
> > > work for mutexes and condition variables because they are pointers to
> > > the actual data structure.
> > > 
> > > In fact, sem_post() unconditionally performs an umtx system call.
> > 
> > *sigh*  I was worried that that might be the case.
> > 
> > > To avoid both lost wakeups and possible writes to a destroyed semaphore,
> > > an uncontested sem_post() must check the _has_waiters flag atomically
> > > with incrementing _count.
> > > 
> > > The proper way to do this would be to take one bit from _count and
> > > use it for the _has_waiters flag; the definition of SEM_VALUE_MAX
> > > permits this. However, this would require a new set of umtx
> > > semaphore operations and will break ABI of process-shared semaphores
> > > (things may break if an old and a new libc access the same semaphore
> > > over shared memory).
> > > 
> > > This diff only affects 32-bit aligned but 64-bit misaligned
> > > semaphores on 64-bit systems, and changes _count and _has_waiters
> > > atomically using a 64-bit atomic operation. It probably needs a
> > > may_alias attribute for correctness, but <sys/cdefs.h> does not have
> > > a wrapper for that.
> > 
> > It wasn't clear on first reading, but you are using aliasing to get
> > around the need for new umtx calls by using a 64-bit atomic op to
> > adjust two ints at the same time, yes?  Note that since a failing
> > semaphore op calls into the kernel for the "hard" case, you might in
> > fact be able to change the ABI without breaking process-shared
> > semaphores.  That is, suppose you left 'has_waiters' as always true
> > and reused the high bit of count for has_waiters.
> > 
> > Would old binaries always trap into the kernel?  (Not sure they will,
> > especially the case where an old binary creates the semaphore, a new
> > binary would have to force has_waiters to true in every sem op, but
> > even that might not be enough.)
> 
> I think that everything will break when a binary linked to old and new
> libcs use the same semaphore. If the new contested bit is set, the old
> sem_getvalue() will return garbage, the old sem_trywait() will fail even
> if the real count is greater than 0, the old sem_wait() and
> sem_timedwait() may spin if the real count is greater than 0 and the old
> sem_post() will fail with [EOVERFLOW].

Well, keep in mind you can't have a binary linked to both libc's generally 
(way too many things break if you do that, like passing the result of strdup() 
from one libc to free() of the other, etc.).  The real problem case is 
multiple binaries.  However, barring truly bizarre cases with people using 
funky LD_LIBRARY_PATH, at least one binary would have to be static to really 
get into trouble as in all but the bizarre cases all binaries will be sharing 
the same libc.so.7.

However, it was more wishful thinking on my part that we could arrange things
to force old binaries to still work (just slowly) while fixing things cleanly.

> > I think this looks ok, but it's a shame we can't fix it more cleanly.
> 
> I don't like at all that you have to either force an exotic alignment or
> use unaligned atomic ops.
> 
> I saw another interesting sem_ implementation in musl libc. It also uses
> a separate waiters word (but an int instead of a boolean), and is based
> on the futex calls (which we already have as wait_uint/wake umtx calls).
> It looks rather complicated but the musl libc people are usually good at
> getting this kind of stuff right. I suppose this will also break subtly
> when interoperating with old libc, and adding new umtx ops will simplify
> the algorithm.
> 
> Consideration: just declare mixing process-shared semaphores with
> sufficiently different libc unsupported, and change SEM_MAGIC to enforce
> that? (This does not prevent running old binaries, as long as they're
> dynamically linked to libc and you use a new libc.so.)

This only breaks old static binaries (note that we only recently got pshared
semaphores in 9.x or so).  You could even allow partial compat for those if 
needed by recognizing the old SEM_MAGIC when opening an existing shared 
semaphore (so that compat would work so long as an old binary created the 
semaphore).  That compat can even be added retroactively to libc if it turns
out we need it.  I probably think that is ok, but it would be an ABI breakage
that would need to be documented.  One thing that makes this slightly less
painful is that new-style semaphores in 9.0+ are per-chroot, so if you have
a jail that runs only old binaries, they will continue to work fine.

I'm cc'ing Konstantin to see what he thinks.

> Consideration 2: use the new implementation only for process-private
> semaphores and keep using the old slow one for process-shared
> semaphores?

This is certainly safe, just requires extra work in the implementation.

-- 
John Baldwin