Re: pondering pi futexes

From: Konstantin Belousov <kostikbel_at_gmail.com>
Date: Mon, 28 Jun 2021 11:20:25 UTC
On Sun, Jun 27, 2021 at 10:39:35PM +0300, Dmitry Chagin wrote:
> 
> Hi,
> some time ago I have changed Linuxulator futexes from sx lock to mtx.
> sx was used as it allows copyin/copyout with sx lock held.
> to use mtx I have changed the code like:
> 1. lock mtx;
> 2. disable_pagefaults;
> 3. copyin()
> 4. enable_pagefaults;
> 5. if error
>    - unlock mtx;
>    copyin();
>    if error == 0 goto 1.
> 
> it works (needto replace copyin() by fueword32()), but pondering pi futexes
> imlementation, I see that it is not possible to drop the futex lock on a return
> from msleep() path. 
> 
> below a simplified FUTEX_LOCK_PI operation, where on enter to the kernel current thread:
> 
> 0. acquire futex lock (which is mtx)
> 1. cmpset(0 -> current thread TID), return (0) on success;
> 2. fetch() from futex *uaddr (for TID of owner):
>    - check EDEADLK case (the futex word at *uaddr is already locked by the caller);
>    - check that is no waiters on *uaddr exists which is waiting via FUTEX_WAIT or
>      FUTEX_WAIT_BITSET, return (EINVAL) if so;
>    - cmpset(TID -> (FUTEX_WAITERS|TID));
>       - on error, the futex owner changed in user-space, repeat from 1.
> 3. Here we have: the owner, one waiter (current thread) and 0 or more waiters
>    sleeping on a waiting_proc. FUTEX_WAITERS bit is set, so any new waiters go to
>    the kernel and owner should unlock futex via the FUTEX_UNLOCK_PI op;
> 4. Try to find the thread which is associated with the owner’s TID:
>    - on error, something bad happened, owner died? Clean owner state link?
>      return (ESRCH). Or if no other waiters? Check this...
>    - on success:
>       - save owner state link to the struct futex (save priority);
>       - check the owner's priority, bump it if needed;
>       - put the current thread to the waiters list in descending priority order;
>       - change priority of all waiters if needed;
>       - msleep on a futex waiting_proc; come back with futex lock held;
>          - restore own priority? If last waiter?; [ponders..]
>          - on timeout return (ETIMEDOUT);
>          - the current thread is the new owner:
> bah!!    - store() the owner TID to *uaddr; [check what should I do on error..]
>          - release futex lock;
>          - return (0).
> 
> is it possible to hold *uaddr page to prevent page faults?

I did not followed exact algorithm you trying to describe.  Still, I can make
two points which could be useful for formulation of the working solution.

1. Umtx AKA FreeBSD native implementation of something very similar to
futex, has a concept of the umtx queue chain. The chain owns the mutex
lock used for 'fast' ops, but for situations like accesses to userspace,
we 'busy' the umtxq chain. De-facto busy state is the hand-rolled
sleepable lock, with usual interlocking against chain mutex, and
msleep/wakeup inter-thread notifications.

2. If you insist on accessing userspace pages while owning non-sleepable
locks, or even sleepable locks which do not compose cleanly with whole
VM+VFS+network (the later due to NFS) locking protocols, we do have a
technique that can be used.  Althought it is complicated and sometimes
slow.

You already noted half of it above, namely disable of the page faults and
copyin_nofault-like operations.  Second half is prefaulting, which is
done by vm_fault_quick_hold_pages().  There are two ways you can use it.

Either you hold the target userspace page, then remap it into KVA and
access userspace object through the kernel mapping.  This assumes that
the object fits into page, also it does not interact with parallel forks
and CoW, which you either have to accept to recheck after.  The usual
races, e.g. userspace remapping the held (actually wired) page under us,
is there, of course, but it is a race anyway.

Or, you hold the page but still access it using copyin_nofault.  Then any
reported error from copyin becomes fatal instead of an indicator to retry.
Error can happen for the same reason, because userspace could unmap or
remap this address under us.

Umtx implementation only uses busy on chains, might be because _nofault+
vm_fault_quick_hold() appeared later.