Prefaulting for i/o buffers

Sun Feb 5 16:47:36 UTC 2012

On Sun, Feb 05, 2012 at 12:43:46PM +0100, Pawel Jakub Dawidek wrote:
> On Fri, Feb 03, 2012 at 09:37:19PM +0200, Konstantin Belousov wrote:
> > FreeBSD I/O infrastructure has well known issue with deadlock caused
> > by vnode lock order reversal when buffers supplied to read(2) or
> > write(2) syscalls are backed by mmaped file.
> > 
> > I previously published the patches to convert i/o path to use VMIO,
> > based on the Jeff Roberson proposal, see
> > http://wiki.freebsd.org/VM6. As a side effect, the VM6 fixed the
> > deadlock. Since that work is very intrusive and did not got any
> > follow-up, it get stalled.
> > 
> > Below is very lightweight patch which only goal is to fix deadlock in
> > the least intrusive way. This is possible after FreeBSD got the
> > vm_fault_quick_hold_pages(9) and vm_fault_disable_pagefaults(9) KPIs.
> > http://people.freebsd.org/~kib/misc/vm1.3.patch
> 
> +struct rl_q_entry *
> +rlqentry_alloc()
> 
> Missing 'void'.
Yes, it is style inconsistency. Fixed.

> 
> +static int
> +rangelock_incompatible(const struct rl_q_entry *e1,
> +    const struct rl_q_entry *e2)
> +{
> +
> +	if ((e1->rl_q_flags & RL_LOCK_TYPE_MASK) == RL_LOCK_READ &&
> +	    (e2->rl_q_flags & RL_LOCK_TYPE_MASK) == RL_LOCK_READ)
> +		return (0);
> +#define	IN_RANGE(a, e) (a >= e->rl_q_start && a < e->rl_q_end)
> +	if (IN_RANGE(e1->rl_q_start, e2) || IN_RANGE(e2->rl_q_start, e1) ||
> +	    IN_RANGE(e1->rl_q_end, e2) || IN_RANGE(e2->rl_q_end, e1))
> +		return (1);
> +#undef	IN_RANGE
> +	return (0);
> +}
> 
> The checks here are a bit wrong and imcomplete.
> 
> This is correct:
> 
> +	if (IN_RANGE(e1->rl_q_start, e2) || IN_RANGE(e2->rl_q_start, e1) ||
> 
> This is not:
> 
> +	    IN_RANGE(e1->rl_q_end, e2) || IN_RANGE(e2->rl_q_end, e1))
> 
> After simplifying one of those checks we get:
> 
> 	if (end1 >= start2 && end1 < end2)
> 
> This will be true if end1 == start2, but in this case it should be
> false.
Agreed, my tests are too strict there. But note that the only consequence
is that some lock requests are postponed.

> 
> There are also some checks missing. If the first range covers entire
> second range or vice-versa you will return that the locks are
> compatible, eg.
> 
> 	start1-----start2-----end2-----end1
Isn't IN_RANGE(e2->rl_q_start, e1) cover this case ?

In essence, the checks I wrote verify whether any one region border lies
inside other region.
> 
> The following check should cover all the cases:
> 
> 	if (e1->rl_q_start < e2->rl_q_end &&
> 	    e1->rl_q_end > e2->rl_q_start) {
> 		return (1);
> 	}
Your check is much more elegant, I changed the code as you suggested.

> 
> +static void
> +rangelock_calc_block(struct rangelock *lock)
> +{
> +	struct rl_q_entry *entry, *entry1, *whead;
> +
> +	if (lock->rl_currdep == TAILQ_FIRST(&lock->rl_waiters) &&
> +	    lock->rl_currdep != NULL)
> +		lock->rl_currdep = TAILQ_NEXT(lock->rl_currdep, rl_q_link);
> +	for (entry = lock->rl_currdep; entry;
> +	     entry = TAILQ_NEXT(entry, rl_q_link)) {
> +		TAILQ_FOREACH(entry1, &lock->rl_waiters, rl_q_link) {
> +			if (rangelock_incompatible(entry, entry1))
> +				goto out;
> +			if (entry1 == entry)
> +				break;
> +		}
> +	}
> +out:
> +	lock->rl_currdep = entry;
> +	TAILQ_FOREACH(whead, &lock->rl_waiters, rl_q_link) {
> +		if (whead == lock->rl_currdep)
> +			break;
> +		if (!(whead->rl_q_flags & RL_LOCK_GRANTED)) {
> +			whead->rl_q_flags |= RL_LOCK_GRANTED;
> +			wakeup(whead);
> +		}
> +	}
> +}
> 
> This function doesn't look at it is going to scale. Searching for
> incompatible locks looks very expensive.
Well, it is not. I am aware of the tree structure use for the range lock,
but this probably have to wait some time.

> 
> This function seems to be responsible for waking up waiters. In case of
> range locking this might be tricky, as unlocking read lock on the given
> range doesn't mean the range can be exclusively locked by a waiter, as
> part of this range might be read-locked multiple times or different
> range might be read locked, but which is also covered by the waiting
> writer.
> 
> Could you elaborate a bit on how this all works together? Does the
> function look through all the waiters and for each waiter goes though
> all the active locks?
> 
> Do you protect somehow against starving writers?
Due to different review, I already added the comments to the rangelock
implementation.

Basically, the rangelocks in kern_rangelock.c are fully fair. The locks
are granted in the order of waiters arrival (the order is defined as
the order of acquring interlock, in the latest version it is a sleepchannel
spinlock).

The free of some range causes rescan of the queue of waiters starting
from rl_currdep. The rl_currdep stores the first blocked waiter.
If next waiter is compatible with all granted regions, it is moved
before rl_currdep and woken up.
> 
> +	KASSERT(entry->rl_q_flags & RL_LOCK_GRANTED, ("XXX"));
> +	KASSERT(entry->rl_q_start == base, ("XXX"));
> +	KASSERT(entry->rl_q_end >= base + len, ("XXX"));
> 
> I expect those XXX will be replaced with real messages in the final
> version? We could really use simple ASSERT() without obligatory message
> in the kernel...
> 
We have MPASS. These XXX are fixed.

> +void *
> +rangelock_rlock(struct rangelock *lock, off_t base, size_t len, struct mtx *ilk)
> +{
> +	struct rl_q_entry *entry;
> +	struct thread *td;
> +
> +	td = curthread;
> +	if (td->td_rlqe != NULL) {
> +		entry = td->td_rlqe;
> +		td->td_rlqe = NULL;
> +	} else
> +		entry = rlqentry_alloc();
> +	entry->rl_q_flags = RL_LOCK_READ;
> +	entry->rl_q_start = base;
> +	entry->rl_q_end = base + len;
> +	return (rangelock_enqueue(lock, entry, ilk));
> +}
> +
> +void *
> +rangelock_wlock(struct rangelock *lock, off_t base, size_t len, struct mtx *ilk)
> +{
> +	struct rl_q_entry *entry;
> +	struct thread *td;
> +
> +	td = curthread;
> +	if (td->td_rlqe != NULL) {
> +		entry = td->td_rlqe;
> +		td->td_rlqe = NULL;
> +	} else
> +		entry = rlqentry_alloc();
> +	entry->rl_q_flags = RL_LOCK_WRITE;
> +	entry->rl_q_start = base;
> +	entry->rl_q_end = base + len;
> +	return (rangelock_enqueue(lock, entry, ilk));
> +}
> 
> Those functions are nearly identical, maybe they can be converted to
> something like this:
> 
> static void *
> rangelock_lock(struct rangelock *lock, off_t base, size_t len, struct mtx *ilk,
>     int flags)
> {
> 	struct rl_q_entry *entry;
> 	struct thread *td;
> 
> 	td = curthread;
> 	if (td->td_rlqe != NULL) {
> 		entry = td->td_rlqe;
> 		td->td_rlqe = NULL;
> 	} else
> 		entry = rlqentry_alloc();
> 	entry->rl_q_flags = RL_LOCK_READ;
> 	entry->rl_q_start = base;
> 	entry->rl_q_end = base + len;
> 	return (rangelock_enqueue(lock, entry, ilk));
> }
> 
> void *
> rangelock_rlock(struct rangelock *lock, off_t base, size_t len, struct mtx *ilk)
> {
> 
> 	return (range_lock(lock, base, len, ilk, RL_LOCK_READ);
> }
> 
> void *
> rangelock_wlock(struct rangelock *lock, off_t base, size_t len, struct mtx *ilk)
> {
> 
> 	return (range_lock(lock, base, len, ilk, RL_LOCK_WRITE);
> }
> 
> Or even merge rangelock_lock() with rangelock_enqueue()?
Done exactly this.

> 
> @@ -199,6 +200,7 @@ thread_init(void *mem, int size, int flags)
>  
>  	td->td_sleepqueue = sleepq_alloc();
>  	td->td_turnstile = turnstile_alloc();
> +	td->td_rlqe = rlqentry_alloc();
> 
> What is the purpose of td_rlqe field? From what I see thread can lock
> more than one range. Could you elaborate on that as well?
td_rlqe is a cached rangelock entry used for the typical case of the
thread not doing recursive i/o. In this case, it is possible to avoid
memory allocation on the i/o hot path by using entry allocated during
thread creation. Since thread creation already allocates many chunks
of memory, and typical thread performs much more i/o then it suffers
creation, this shorten the i/o calculation path.

If tq_rlqe is already consumed, new entry is allocated at the lock time.
> 
> Do you assert somehow that the same thread is trying to write-lock the
> range it already read-locked?
No. With the latest patch it is indeed can be done, because I started to
record range owners. For now, I added a comment with TODO mentioning the
possible assert.

> 
> --- a/sys/kern/subr_syscall.c
> +++ b/sys/kern/subr_syscall.c
> @@ -177,6 +177,12 @@ syscallret(struct thread *td, int error, struct syscall_args *sa __unused)
>  	KASSERT(td->td_locks == 0,
>  	    ("System call %s returning with %d locks held",
>  	     syscallname(p, sa->code), td->td_locks));
> +	KASSERT((td->td_pflags & TDP_NOFAULTING) == 0,
> +	    ("System call %s returning with pagefaults disabled",
> +	     syscallname(p, sa->code)));
> +	KASSERT((td->td_pflags & TDP_NOSLEEPING) == 0,
> +	    ("System call %s returning with sleep disabled",
> +	     syscallname(p, sa->code)));
> 
> We can commit those separately, right?
Definitely. Rangelocks will be committed separately too. The vfs_vnops.c
change is self-contained.

> 
>  #ifdef MAC
>  	if ((ioflg & IO_NOMACCHECK) == 0) {
> -		if (rw == UIO_READ)
> -			error = mac_vnode_check_read(active_cred, file_cred,
> -			    vp);
> -		else
> +		if (rw == UIO_WRITE)
>  			error = mac_vnode_check_write(active_cred, file_cred,
>  			    vp);
>  	}
> 
> I don't see mac_vnode_check_read() call being moved anywhere.
> Was that intended?
No, thank you for pointing this. mdf already noted this.

> 
> + * The vn_io_fault() is a wrapper around vn_read() and vn_write() to
> + * prevent the following deadlock:
> + *
> + * Assume that the thread A reads from the vnode vp1 info userspace
> 
> s/info/into/
> 
> + * buffer buf1 backed by the pages of vnode vp2.  If a page in buf1 is
> + * currently not resident, then system ends up with the call chain
> + *   vn_read() -> VOP_READ(vp1) -> uiomove() -> [Page Fault] ->
> + *     vm_fault(buf1) -> vnode_pager_getpages(vp2) -> VOP_GETPAGES(vp2)
> + * which establishes lock order vp1->vn_lock, then vp2->vn_lock.
> + * If, at the same time, thread B reads from vnode vp2 into buffer buf2
> + * backed by the pages of vnode vp1, and some page in buf2 is not
> + * resident, we get a reversed order vp2->vn_lock, then vp1->vn_lock.
> + *
> + * To prevent the lock order reversal and deadlock, vn_io_fault() does
> + * not allow page faults to happen during VOP_READ or VOP_WRITE().
> 
> s/VOP_READ/VOP_READ()/
> 
> +		prot = VM_PROT_READ;
> +		if ((fp->f_flag & O_APPEND) || !(flags & FOF_OFFSET))
> +			/* For appenders, punt and lock the whole range. */
> +			rl_cookie = vn_rangelock_wlock(vp, 0, (size_t)-1);
> 
> On 32bit archs size_t is still 32bit, AFAIR and file size can be bigger
> than that. I think we should also use off_t for length (GEOM does that).
Agreed, but I moved to specifying (start,end) instead of (start,len) already.

> 
> +		len = uio->uio_iov->iov_len;
> +		addr = (vm_offset_t)uio->uio_iov->iov_base;
> +		end = round_page(addr + len);
> +		cnt = howmany(end - trunc_page(addr), PAGE_SIZE);
> +		if (cnt > io_hold_cnt) {
> +			len = io_hold_cnt * PAGE_SIZE;
> +again:
> +			end = round_page(addr + len);
> +			cnt = howmany(end - trunc_page(addr), PAGE_SIZE);
> +			if (cnt > io_hold_cnt) {
> +				len -= PAGE_SIZE;
> +				goto again;
> +			}
> +		}
> 
> This is a bit hard to understand immediately. Maybe something like the
> following is a bit more readable (I assume this goto could happen only
> once):
> 
> 		len = MIN(uio->uio_iov->iov_len, io_hold_cnt * PAGE_SIZE);
> 		addr = (vm_offset_t)uio->uio_iov->iov_base;
> 		end = round_page(addr + len);
> 		if (howmany(end - trunc_page(addr), PAGE_SIZE) > io_hold_cnt) {
> 			len -= PAGE_SIZE;
> 			end = round_page(addr + len);
> 		}
I think it can happen twice, if both start and len are perfectly mis-aligned.

> 
> +vn_truncate(struct file *fp, off_t length, struct ucred *active_cred,
> +    struct thread *td)
> [...]
> +	/*
> +	 * Lock the range where the shortening take place. Increase of
> +	 * file size does not need rangelock, but it is faster to lock
> +	 * the range then call VOP_GETATTR to get the current size and
> +	 * deal with races.
> +	 */
> +	rl_cookie = vn_rangelock_wlock(vp, length, -1);
> 
> Hmm. In case of increasing file size don't we need the range lock
> because we might be racing with write append?
We need to exclude the split i/o from being performed on the disjoint
ranges of the file if truncation happens in between split.

I changed the rangelock obtained in vn_truncate to get the whole range.
The bug appeared because I imported the rangelock from the bigger
patch that indeed handled the expansions.

Thank you for pointing this out.
> 
> BTW. Comments in sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_rlock.c
> might be a useful read.

Thank you for the review.

http://people.freebsd.org/~kib/misc/vm1.8.patch
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 196 bytes
Desc: not available
Url : http://lists.freebsd.org/pipermail/freebsd-arch/attachments/20120205/5fd5a4e7/attachment.pgp