sem_post() performance

Mon Sep 22 20:36:13 UTC 2014

On Sunday, September 21, 2014 11:37:42 PM Jilles Tjoelker wrote:
> It has been reported that POSIX semaphores are slow, in contexts such as
> Python. Note that POSIX semaphores are the only synchronization objects
> that support use by different processes in shared memory; this does not
> work for mutexes and condition variables because they are pointers to
> the actual data structure.
> 
> In fact, sem_post() unconditionally performs an umtx system call.

*sigh*  I was worried that that might be the case.

> To avoid both lost wakeups and possible writes to a destroyed semaphore,
> an uncontested sem_post() must check the _has_waiters flag atomically
> with incrementing _count.
> 
> The proper way to do this would be to take one bit from _count and use
> it for the _has_waiters flag; the definition of SEM_VALUE_MAX permits
> this. However, this would require a new set of umtx semaphore operations
> and will break ABI of process-shared semaphores (things may break if an
> old and a new libc access the same semaphore over shared memory).
> 
> This diff only affects 32-bit aligned but 64-bit misaligned semaphores
> on 64-bit systems, and changes _count and _has_waiters atomically using
> a 64-bit atomic operation. It probably needs a may_alias attribute for
> correctness, but <sys/cdefs.h> does not have a wrapper for that.

It wasn't clear on first reading, but you are using aliasing to get around
the need for new umtx calls by using a 64-bit atomic op to adjust two
ints at the same time, yes?  Note that since a failing semaphore op calls
into the kernel for the "hard" case, you might in fact be able to change
the ABI without breaking process-shared semaphores.  That is, suppose you left 
'has_waiters' as always true and reused the high bit of count for has_waiters.  
Would old binaries always trap into the kernel?  (Not sure they will, 
especially the case where an old binary creates the semaphore, a new binary
would have to force has_waiters to true in every sem op, but even that might 
not be enough.)

> Some x86 CPUs may cope with misaligned atomic ops without destroying
> performance (especially if they do not cross a cache line), so the
> alignment restriction could be relaxed to make the patch more practical.
> 
> Many CPUs in the i386 architecture have a 64-bit atomic op (cmpxchg8b)
> which could be used here.
> 
> This appears to restore performance of 10-stable uncontested semaphores
> with the strange alignment to 9-stable levels (a tight loop with
> sem_wait and sem_post). I have not tested in any real workload.
> 
> Index: lib/libc/gen/sem_new.c
> ===================================================================
> --- lib/libc/gen/sem_new.c	(revision 269952)
> +++ lib/libc/gen/sem_new.c	(working copy)
> @@ -437,6 +437,32 @@ _sem_post(sem_t *sem)
>  	if (sem_check_validity(sem) != 0)
>  		return (-1);
> 
> +#ifdef __LP64__
> +	if (((uintptr_t)&sem->_kern._count & 7) == 0) {
> +		uint64_t oldval, newval;
> +
> +		while (!sem->_kern._has_waiters) {
> +			count = sem->_kern._count;
> +			if (count + 1 > SEM_VALUE_MAX)
> +				return (EOVERFLOW);
> +			/*
> +			 * Expect _count == count and _has_waiters == 0.
> +			 */
> +#if BYTE_ORDER == LITTLE_ENDIAN
> +			oldval = (uint64_t)count << 32;
> +			newval = (uint64_t)(count + 1) << 32;
> +#elif BYTE_ORDER == BIG_ENDIAN
> +			oldval = (uint64_t)count;
> +			newval = (uint64_t)(count + 1);
> +#else
> +#error Unknown byte order
> +#endif
> +			if (atomic_cmpset_rel_64((uint64_t *)&sem->_kern._count,
> +			    oldval, newval))
> +				return (0);
> +		}
> +	}
> +#endif

I think this looks ok, but it's a shame we can't fix it more cleanly.

-- 
John Baldwin