Robust mutexes implementation

Fri May 6 23:30:15 UTC 2016

On Thu, May 05, 2016 at 04:10:29PM +0300, Konstantin Belousov wrote:
> I implemented robust mutexes for our libthr.  A robust mutex is
> guaranteed to be cleared by the system upon either thread or process
> owner termination while the mutex is held.  The next mutex locker is
> then notified about inconsistent mutex state and can execute (or
> abandon) corrective actions.

> The patch mostly consists of small changes here and there, adding
> neccessary checks for the inconsistent and abandoned conditions into
> existing paths.  Additionally, the thread exit handler was extended
> to iterate over the userspace-maintained list of owned robust mutexes,
> unlocking and marking as terminated each of them.

> The list of owned robust mutexes cannot be maintained atomically
> synchronous with the mutex lock state (it is possible in kernel, but
> is too expensive).  Instead, for the duration of lock or unlock
> operation, the current mutex is remembered in a special slot that is
> also checked by the kernel at thread termination.

> Kernel must be aware about the per-thread location of the heads of
> robust mutex lists and the current active mutex slot.  Initially I tried
> to extend TCBs with this data, so only a single syscall at the
> threading library initialization would be needed: for any thread the
> location of TCB is known by kernel, and the syscall would pass
> offsets.  Unfortunately, on some architectures the size of TCB is part
> of the fixed ABI and cannot be changed.  Instead, when a thread touches
> a robust mutex for the first time, a new umtx op syscall is issued which
> informs about location of lists heads.

> The umtx sleep queues for PP and PI mutexes are split between
> non-robust and robust.  I do not understand the reasoning behind this
> POSIX requirement.

> Patch passes all glibc tests for robust mutexes I found in the nptl/
> directory.  See https://github.com/kostikbel/glibc-robust-tests .

> Patch is available at https://kib.kiev.ua/kib/pshared/robust.1.patch
> (beware of self-signed root certificate in the chain). Work was
> sponsored by The FreeBSD Foundation.

> Unrelated things in the patch:

> 1. Style.  Since I had to re-read whole sys/kern/kern_umtx.c,
>    lib/libthr/thread/thr_umtx.h and lib/libthr/thread/thr_umtx.c, I
>    started fixing the numerous style violations in these files, which
>    actually made my eyes bleed.

> 2. The fix for proper tdfind() call use in umtxq_sleep_pi() for shared
>    pi mutexes.

> 3. Removal of the struct pthread_mutex m_owner field.  I cannot see
>    why it is useful.  The only optimization it provides is the
>    possibility to avoid clearing UMUTEX_CONTESTED bit when reading
>    m_lock.m_owner.  The disadvantages of having this duplicated field
>    is that kernel does not know about pthread_mutex, so cannot fix the
>    dup value.  Overall it is less work to clear UMUTEX_CONTESTED when
>    checking owner, then to try and handle inconsistencies.

>    I added the PMUTEX_OWNER_ID() macro to simplify code.

> 4. The sysctl kern.ipc.umtx_vnode_persistent is added, which controls
>    the lifetime of the shared mutex associated with a vnode' page.
>    Apparently, there is real code around which expects the following
>    to work:
>    - mmap a file, create a shared mutex in the mapping;
>    - the process exits;
>    - another process starts, mmaps the same file and expects that the
>      previously initialized mutex is still usable.

>    The knob changes the lifetime of such shared off-page from the
>    'destroy on last unmap' to either 'until vnode is reclaimed' or
>    until 'pthread_mutex_destroy' called, whatever comes first.

The 'until vnode is reclaimed' bit sounds like a recipe for hard to
reproduce bugs.

I do think it is related to the robust mutex patch, though.

Without robust mutexes and assuming threads do not unmap the memory
while having a mutex locked or while waiting on a condition variable, it
is sufficient to create the off-page mutex/condvar automatically in its
initial state when pthread_mutex_lock() or pthread_cond_*wait() are
called and no off-page object exists.

With robust mutexes, we need to store somewhere whether the next thread
should receive [EOWNERDEAD] or not, and this should persist even if no
processes have the memory mapped. This might be done by replacing
THR_PSHARED_PTR with a different value in the pthread_mutex_t. I'm not
sure I like that memory write being done from the kernel though.

The below is a review of https://kib.kiev.ua/kib/pshared/robust.2.patch

> diff --git a/lib/libc/gen/Symbol.map b/lib/libc/gen/Symbol.map

It is not necessary to export _pthread_mutex_consistent,
_pthread_mutexattr_getrobust and _pthread_mutexattr_setrobust (under
FBSDprivate_1.0 symbol version). They are not used by name outside the
DSO, only via the jump table.

The same thing is true of many other FBSDprivate_1.0 symbols but there
is a difference between adding new unnecessary exports and removing
existing unnecessary exports.

> +		if ((error2 == 0 || error2 == EOWNERDEAD) && cancel)
>  			_thr_testcancel(curthread);

I don't think [EOWNERDEAD] should be swept under the carpet like this.
The cancellation cleanup handler will use the protected state and unlock
the mutex without making the state consistent and the state will be
unrecoverable.

> +void
> +_mutex_leave_robust(struct pthread *curthread, struct pthread_mutex *m)
> +{
> +
> +	if (!is_robust_mutex(m))
> +		return;

This accesses the mutex after writing a value to the lock
word which allows other threads to lock it. A use after free may result,
since it is valid for another thread to lock, unlock and destroy the
mutex (assuming the mutex is not used otherwise later).

Memory ordering may permit the load of m->m_lock.m_flags to be moved
before the actual unlock, so this issue may not actually appear.

Given that the overhead of a system call on every robust mutex unlock is
not desired, the kernel's unlock of a terminated thread's mutexes will
unavoidably have this use after free. However, for non-robust mutexes
the previous guarantees should be kept.

> +	int defered, error;

Typo, should be "deferred".

> +.It Bq Er EOWNERDEAD
> +The argument
> +.Fa mutex
> +points to the robust mutex and the previous owning thread terminated
> +while holding the mutex lock.
> +The lock was granted to the caller and it is up to the new owner
> +to make the state consistent.

"points to a robust mutex".

Perhaps add a See .Xr pthread_mutex_consistent 3 here.

> diff --git a/share/man/man3/pthread_mutex_consistent.3 b/share/man/man3/pthread_mutex_consistent.3

This man page should mention that pthread_mutex_consistent() can be
called when a mutex lock or condition variable wait failed with
[EOWNERDEAD].

> @@ -37,7 +37,11 @@ struct umutex {
> +	__uintptr_t		m_rb_lnk;	/* Robust linkage */

Although Linux also stores the robust list nodes in the mutexes like
this, I think it increases the chance of strange memory corruption.
Making the robust list an array in the thread's memory would be more
reliable. If the maximum number of owned robust mutexes can be small,
this can have a fixed size; otherwise, it needs to grow as needed (which
does add an allocation that may fail to the pthread_mutex_lock path,
bad).

> + * The umutex.m_lock values and bits.  The m_owner is the word which
> + * serves as the lock.  It's high bit is the contention indicator,
> + * rest of bits records the owner TID.  TIDs values start with PID_MAX
> + * + 2 and ends by INT32_MAX, the low range [1..PID_MAX] is guaranteed
> + * to be useable as the special markers.

Typo, "It's" should be "Its" and "ends" should be "end".

Bruce Evans would probably complain about comma splice (two times).

> +#define	UMTX_OP_ROBUST		26

The name is rather vague. Perhaps this can be something like
UMTX_OP_INIT_ROBUST.

-- 
Jilles Tjoelker