libthr shared locks

Wed Dec 23 17:25:41 UTC 2015

A well-known limitation of the FreeBSD libthr implementation of the
pthread locking objects is the missed support for the process-shared
locks.  The hardest part of implementing the support is the neccessity
of providing ABI compatibility for the current implementation.

Right now, the ABI-visible handle to the locks is a single pointer
word.  As an example which allows to make the description less vague,
let consider pthread_mutex_t.  It is defined in
sys/sys/_pthreadtypes.h as
typedef struct	pthread_mutex		*pthread_mutex_t;
The pointer points to the following structure, after the
pthread_mutex_init(3) is called
struct pthread_mutex {
	struct umutex			m_lock;
	int				m_flags;
	struct pthread			*m_owner;
	int				m_count;
	int				m_spinloops;
	int				m_yieldloops;
	TAILQ_ENTRY(pthread_mutex)	m_qe;
};
struct umutex {
	volatile __lwpid_t	m_owner;	/* Owner of the mutex */
	__uint32_t		m_flags;	/* Flags of the mutex */
	__uint32_t		m_ceilings[2];	/* Priority protect ceiling */
	__uint32_t		m_spare[4];
};

Would the ABI modified to make the pthread_mutex_t large enough to
hold struct pthread_mutex, the rest of the implementation of the
shared mutex is relatively trivial, if not already done.

Changing this ABI is very hard.  libthr provides the symbol
versioning, which allows to provide compatible shims for the previous
ABI variant.  But since userspace tends to use the pthread objects in
the layouts of the library objects, this causes serious ABI issues
when mixing libraries built against different default versions of
libthr.

My idea to provide the shared locks, while not changing the ABI for
libthr, is to use marker pointers to indicate the shared objects.  The
real struct pthread_mutex, which carries the locking information, is
allocated by at the off-page from some anonymous posix shared memory
object.

The marker is defined as
#define	THR_PSHARED_PTR						\
    ((void *)(uintptr_t)((1ULL << (NBBY * sizeof(long) - 1)) | 1))
The bit-pattern is 1000....00001.  There are two tricks used:
1. All correctly allocated objects in all supported ABIs are at least
   word-aligned, so the least-significant bit cannot be set.  This
   should made the THR_PSHARED_PTR pattern unique against non-shared
   allocations.
2. The high bit is set, which makes the address non-canonical on
   amd64, causing attempts to dereference the pointer guaranteed to
   segfault, instead of relying of not having the corresponding page
   not mapped on the weakly-aligned arches.

The majority of the libthr modifications follow the easy pattern where
the library must store the THR_PSHARED_PTR upon the initialization of
the shared objects, allocate the off-page and initialize the lock
there.  If a call assumes that the object is already initialized, then
the we must not instantiate the off-page.  To speed-up the lookup, a
cache is kept at the userspace which translates address of locks to
the off-page.  Note that we can safely ignore possible unmapping of
the locks, since correct pthread_* API use assumes the call to
pthread_*_destroy() on the end of the object lifecycle.  If the lock
is remapped in the usermode, then userspace off-page translation cache
fails, but kernel returns the same shm for lookup, and we end with two
off-page mappings, which is acceptable.

Kernel holds a lookup table which translates the (vm_object, offset)
pair, obtained by the dereference of the user-space address, into the
posix shared memory object.  The lifecycle of the shm objects is bound
to the existence of the corresponding vm object.

Note that lifecycle of the kernel objects does not correspond well to
the lifecycle of the vnode vm object.  Closed vnode could be recycled
by VFS for whatever reasons, and then we would loose the entry in the
registry.  I am not sure if this is very serious issue, since I
suppose that typical use case assumes the anonymous shared memory
backing.  Right now kernel drops the off-page shm object on the last
vnode unmap.

Due to backing by the kernel objects, the implementation imposes
per-uid limits on the amount of the shared objects created.  An issue
is that there are no such limits in other implementations.

Overhead of the implementation, comparing with the non-process shared
locks, is due to the mandatory off-page lookup, which is mostly
ammortized by the (read-locked) userspace cache.  Also, for each
shared lock we get an additional page of memory, which works fine
assuming the applications use limited amount of the shared locks.
Cost for the non-shared locks is a single memory load for each
pthread_* call.

Below are the timing results of my implementation on the 4-core sandy
against the Fedora 22 glibc, done with the same program on the same
hardware (https://www.kib.kiev.ua/kib/pshared/pthread_shared_mtx1.c).

[FreeBSD]
# time /root/pthread_shared_mtx1-64
iter1 10000000 aiter1 10000000 iter2 10000000 aiter2 10000000
./pthread_shared_mtx1-64  2.47s user 3.27s system 166% cpu 3.443 total

[Fedora]
[kostik at sandy tests]$ /usr/bin/time ./pthread_shared_mtx1-linux64 
iter1 10000000 aiter1 10000000 iter2 10000000 aiter2 10000000
1.38user 2.46system 0:01.95elapsed 196%CPU (0avgtext+0avgdata 1576maxresident)k
0inputs+0outputs (0major+142minor)pagefaults 0swaps

The implementation in the patch
https://www.kib.kiev.ua/kib/pshared/pshared.1.patch
gives shared mutexes, condvars, rwlocks and barriers. I did some
smoke-testing, only on amd64. Not implementated are the robust mutexes.
I want to finalize this part of work before implementing robustness,
but some restructuring in the patch, which seems to be arbitrary, like
the normal/pp queues rework to live in arrays, is a preparation to the
robustness feature.

The work was sponsored by The FreeBSD Foundation, previous and current
versions of idea and previous patch were discussed with John Baldwin
and Ed Maste.