sem_post() performance

Sun Sep 21 21:37:45 UTC 2014

It has been reported that POSIX semaphores are slow, in contexts such as
Python. Note that POSIX semaphores are the only synchronization objects
that support use by different processes in shared memory; this does not
work for mutexes and condition variables because they are pointers to
the actual data structure.

In fact, sem_post() unconditionally performs an umtx system call.

To avoid both lost wakeups and possible writes to a destroyed semaphore,
an uncontested sem_post() must check the _has_waiters flag atomically
with incrementing _count.

The proper way to do this would be to take one bit from _count and use
it for the _has_waiters flag; the definition of SEM_VALUE_MAX permits
this. However, this would require a new set of umtx semaphore operations
and will break ABI of process-shared semaphores (things may break if an
old and a new libc access the same semaphore over shared memory).

This diff only affects 32-bit aligned but 64-bit misaligned semaphores
on 64-bit systems, and changes _count and _has_waiters atomically using
a 64-bit atomic operation. It probably needs a may_alias attribute for
correctness, but <sys/cdefs.h> does not have a wrapper for that.

Some x86 CPUs may cope with misaligned atomic ops without destroying
performance (especially if they do not cross a cache line), so the
alignment restriction could be relaxed to make the patch more practical.

Many CPUs in the i386 architecture have a 64-bit atomic op (cmpxchg8b)
which could be used here.

This appears to restore performance of 10-stable uncontested semaphores
with the strange alignment to 9-stable levels (a tight loop with
sem_wait and sem_post). I have not tested in any real workload.

Index: lib/libc/gen/sem_new.c
===================================================================

--- lib/libc/gen/sem_new.c	(revision 269952)
+++ lib/libc/gen/sem_new.c	(working copy)
@@ -437,6 +437,32 @@ _sem_post(sem_t *sem)
 	if (sem_check_validity(sem) != 0)
 		return (-1);
 
+#ifdef __LP64__
+	if (((uintptr_t)&sem->_kern._count & 7) == 0) {
+		uint64_t oldval, newval;
+
+		while (!sem->_kern._has_waiters) {
+			count = sem->_kern._count;
+			if (count + 1 > SEM_VALUE_MAX)
+				return (EOVERFLOW);
+			/*
+			 * Expect _count == count and _has_waiters == 0.
+			 */
+#if BYTE_ORDER == LITTLE_ENDIAN
+			oldval = (uint64_t)count << 32;
+			newval = (uint64_t)(count + 1) << 32;
+#elif BYTE_ORDER == BIG_ENDIAN
+			oldval = (uint64_t)count;
+			newval = (uint64_t)(count + 1);
+#else
+#error Unknown byte order
+#endif
+			if (atomic_cmpset_rel_64((uint64_t *)&sem->_kern._count,
+			    oldval, newval))
+				return (0);
+		}
+	}
+#endif
 	do {
 		count = sem->_kern._count;
 		if (count + 1 > SEM_VALUE_MAX)

-- 
Jilles Tjoelker