Prefaulting for i/o buffers
Konstantin Belousov
kostikbel at gmail.com
Sun Feb 5 16:47:36 UTC 2012
On Sun, Feb 05, 2012 at 12:43:46PM +0100, Pawel Jakub Dawidek wrote:
> On Fri, Feb 03, 2012 at 09:37:19PM +0200, Konstantin Belousov wrote:
> > FreeBSD I/O infrastructure has well known issue with deadlock caused
> > by vnode lock order reversal when buffers supplied to read(2) or
> > write(2) syscalls are backed by mmaped file.
> >
> > I previously published the patches to convert i/o path to use VMIO,
> > based on the Jeff Roberson proposal, see
> > http://wiki.freebsd.org/VM6. As a side effect, the VM6 fixed the
> > deadlock. Since that work is very intrusive and did not got any
> > follow-up, it get stalled.
> >
> > Below is very lightweight patch which only goal is to fix deadlock in
> > the least intrusive way. This is possible after FreeBSD got the
> > vm_fault_quick_hold_pages(9) and vm_fault_disable_pagefaults(9) KPIs.
> > http://people.freebsd.org/~kib/misc/vm1.3.patch
>
> +struct rl_q_entry *
> +rlqentry_alloc()
>
> Missing 'void'.
Yes, it is style inconsistency. Fixed.
>
> +static int
> +rangelock_incompatible(const struct rl_q_entry *e1,
> + const struct rl_q_entry *e2)
> +{
> +
> + if ((e1->rl_q_flags & RL_LOCK_TYPE_MASK) == RL_LOCK_READ &&
> + (e2->rl_q_flags & RL_LOCK_TYPE_MASK) == RL_LOCK_READ)
> + return (0);
> +#define IN_RANGE(a, e) (a >= e->rl_q_start && a < e->rl_q_end)
> + if (IN_RANGE(e1->rl_q_start, e2) || IN_RANGE(e2->rl_q_start, e1) ||
> + IN_RANGE(e1->rl_q_end, e2) || IN_RANGE(e2->rl_q_end, e1))
> + return (1);
> +#undef IN_RANGE
> + return (0);
> +}
>
> The checks here are a bit wrong and imcomplete.
>
> This is correct:
>
> + if (IN_RANGE(e1->rl_q_start, e2) || IN_RANGE(e2->rl_q_start, e1) ||
>
> This is not:
>
> + IN_RANGE(e1->rl_q_end, e2) || IN_RANGE(e2->rl_q_end, e1))
>
> After simplifying one of those checks we get:
>
> if (end1 >= start2 && end1 < end2)
>
> This will be true if end1 == start2, but in this case it should be
> false.
Agreed, my tests are too strict there. But note that the only consequence
is that some lock requests are postponed.
>
> There are also some checks missing. If the first range covers entire
> second range or vice-versa you will return that the locks are
> compatible, eg.
>
> start1-----start2-----end2-----end1
Isn't IN_RANGE(e2->rl_q_start, e1) cover this case ?
In essence, the checks I wrote verify whether any one region border lies
inside other region.
>
> The following check should cover all the cases:
>
> if (e1->rl_q_start < e2->rl_q_end &&
> e1->rl_q_end > e2->rl_q_start) {
> return (1);
> }
Your check is much more elegant, I changed the code as you suggested.
>
> +static void
> +rangelock_calc_block(struct rangelock *lock)
> +{
> + struct rl_q_entry *entry, *entry1, *whead;
> +
> + if (lock->rl_currdep == TAILQ_FIRST(&lock->rl_waiters) &&
> + lock->rl_currdep != NULL)
> + lock->rl_currdep = TAILQ_NEXT(lock->rl_currdep, rl_q_link);
> + for (entry = lock->rl_currdep; entry;
> + entry = TAILQ_NEXT(entry, rl_q_link)) {
> + TAILQ_FOREACH(entry1, &lock->rl_waiters, rl_q_link) {
> + if (rangelock_incompatible(entry, entry1))
> + goto out;
> + if (entry1 == entry)
> + break;
> + }
> + }
> +out:
> + lock->rl_currdep = entry;
> + TAILQ_FOREACH(whead, &lock->rl_waiters, rl_q_link) {
> + if (whead == lock->rl_currdep)
> + break;
> + if (!(whead->rl_q_flags & RL_LOCK_GRANTED)) {
> + whead->rl_q_flags |= RL_LOCK_GRANTED;
> + wakeup(whead);
> + }
> + }
> +}
>
> This function doesn't look at it is going to scale. Searching for
> incompatible locks looks very expensive.
Well, it is not. I am aware of the tree structure use for the range lock,
but this probably have to wait some time.
>
> This function seems to be responsible for waking up waiters. In case of
> range locking this might be tricky, as unlocking read lock on the given
> range doesn't mean the range can be exclusively locked by a waiter, as
> part of this range might be read-locked multiple times or different
> range might be read locked, but which is also covered by the waiting
> writer.
>
> Could you elaborate a bit on how this all works together? Does the
> function look through all the waiters and for each waiter goes though
> all the active locks?
>
> Do you protect somehow against starving writers?
Due to different review, I already added the comments to the rangelock
implementation.
Basically, the rangelocks in kern_rangelock.c are fully fair. The locks
are granted in the order of waiters arrival (the order is defined as
the order of acquring interlock, in the latest version it is a sleepchannel
spinlock).
The free of some range causes rescan of the queue of waiters starting
from rl_currdep. The rl_currdep stores the first blocked waiter.
If next waiter is compatible with all granted regions, it is moved
before rl_currdep and woken up.
>
> + KASSERT(entry->rl_q_flags & RL_LOCK_GRANTED, ("XXX"));
> + KASSERT(entry->rl_q_start == base, ("XXX"));
> + KASSERT(entry->rl_q_end >= base + len, ("XXX"));
>
> I expect those XXX will be replaced with real messages in the final
> version? We could really use simple ASSERT() without obligatory message
> in the kernel...
>
We have MPASS. These XXX are fixed.
> +void *
> +rangelock_rlock(struct rangelock *lock, off_t base, size_t len, struct mtx *ilk)
> +{
> + struct rl_q_entry *entry;
> + struct thread *td;
> +
> + td = curthread;
> + if (td->td_rlqe != NULL) {
> + entry = td->td_rlqe;
> + td->td_rlqe = NULL;
> + } else
> + entry = rlqentry_alloc();
> + entry->rl_q_flags = RL_LOCK_READ;
> + entry->rl_q_start = base;
> + entry->rl_q_end = base + len;
> + return (rangelock_enqueue(lock, entry, ilk));
> +}
> +
> +void *
> +rangelock_wlock(struct rangelock *lock, off_t base, size_t len, struct mtx *ilk)
> +{
> + struct rl_q_entry *entry;
> + struct thread *td;
> +
> + td = curthread;
> + if (td->td_rlqe != NULL) {
> + entry = td->td_rlqe;
> + td->td_rlqe = NULL;
> + } else
> + entry = rlqentry_alloc();
> + entry->rl_q_flags = RL_LOCK_WRITE;
> + entry->rl_q_start = base;
> + entry->rl_q_end = base + len;
> + return (rangelock_enqueue(lock, entry, ilk));
> +}
>
> Those functions are nearly identical, maybe they can be converted to
> something like this:
>
> static void *
> rangelock_lock(struct rangelock *lock, off_t base, size_t len, struct mtx *ilk,
> int flags)
> {
> struct rl_q_entry *entry;
> struct thread *td;
>
> td = curthread;
> if (td->td_rlqe != NULL) {
> entry = td->td_rlqe;
> td->td_rlqe = NULL;
> } else
> entry = rlqentry_alloc();
> entry->rl_q_flags = RL_LOCK_READ;
> entry->rl_q_start = base;
> entry->rl_q_end = base + len;
> return (rangelock_enqueue(lock, entry, ilk));
> }
>
> void *
> rangelock_rlock(struct rangelock *lock, off_t base, size_t len, struct mtx *ilk)
> {
>
> return (range_lock(lock, base, len, ilk, RL_LOCK_READ);
> }
>
> void *
> rangelock_wlock(struct rangelock *lock, off_t base, size_t len, struct mtx *ilk)
> {
>
> return (range_lock(lock, base, len, ilk, RL_LOCK_WRITE);
> }
>
> Or even merge rangelock_lock() with rangelock_enqueue()?
Done exactly this.
>
> @@ -199,6 +200,7 @@ thread_init(void *mem, int size, int flags)
>
> td->td_sleepqueue = sleepq_alloc();
> td->td_turnstile = turnstile_alloc();
> + td->td_rlqe = rlqentry_alloc();
>
> What is the purpose of td_rlqe field? From what I see thread can lock
> more than one range. Could you elaborate on that as well?
td_rlqe is a cached rangelock entry used for the typical case of the
thread not doing recursive i/o. In this case, it is possible to avoid
memory allocation on the i/o hot path by using entry allocated during
thread creation. Since thread creation already allocates many chunks
of memory, and typical thread performs much more i/o then it suffers
creation, this shorten the i/o calculation path.
If tq_rlqe is already consumed, new entry is allocated at the lock time.
>
> Do you assert somehow that the same thread is trying to write-lock the
> range it already read-locked?
No. With the latest patch it is indeed can be done, because I started to
record range owners. For now, I added a comment with TODO mentioning the
possible assert.
>
> --- a/sys/kern/subr_syscall.c
> +++ b/sys/kern/subr_syscall.c
> @@ -177,6 +177,12 @@ syscallret(struct thread *td, int error, struct syscall_args *sa __unused)
> KASSERT(td->td_locks == 0,
> ("System call %s returning with %d locks held",
> syscallname(p, sa->code), td->td_locks));
> + KASSERT((td->td_pflags & TDP_NOFAULTING) == 0,
> + ("System call %s returning with pagefaults disabled",
> + syscallname(p, sa->code)));
> + KASSERT((td->td_pflags & TDP_NOSLEEPING) == 0,
> + ("System call %s returning with sleep disabled",
> + syscallname(p, sa->code)));
>
> We can commit those separately, right?
Definitely. Rangelocks will be committed separately too. The vfs_vnops.c
change is self-contained.
>
> #ifdef MAC
> if ((ioflg & IO_NOMACCHECK) == 0) {
> - if (rw == UIO_READ)
> - error = mac_vnode_check_read(active_cred, file_cred,
> - vp);
> - else
> + if (rw == UIO_WRITE)
> error = mac_vnode_check_write(active_cred, file_cred,
> vp);
> }
>
> I don't see mac_vnode_check_read() call being moved anywhere.
> Was that intended?
No, thank you for pointing this. mdf already noted this.
>
> + * The vn_io_fault() is a wrapper around vn_read() and vn_write() to
> + * prevent the following deadlock:
> + *
> + * Assume that the thread A reads from the vnode vp1 info userspace
>
> s/info/into/
>
> + * buffer buf1 backed by the pages of vnode vp2. If a page in buf1 is
> + * currently not resident, then system ends up with the call chain
> + * vn_read() -> VOP_READ(vp1) -> uiomove() -> [Page Fault] ->
> + * vm_fault(buf1) -> vnode_pager_getpages(vp2) -> VOP_GETPAGES(vp2)
> + * which establishes lock order vp1->vn_lock, then vp2->vn_lock.
> + * If, at the same time, thread B reads from vnode vp2 into buffer buf2
> + * backed by the pages of vnode vp1, and some page in buf2 is not
> + * resident, we get a reversed order vp2->vn_lock, then vp1->vn_lock.
> + *
> + * To prevent the lock order reversal and deadlock, vn_io_fault() does
> + * not allow page faults to happen during VOP_READ or VOP_WRITE().
>
> s/VOP_READ/VOP_READ()/
>
> + prot = VM_PROT_READ;
> + if ((fp->f_flag & O_APPEND) || !(flags & FOF_OFFSET))
> + /* For appenders, punt and lock the whole range. */
> + rl_cookie = vn_rangelock_wlock(vp, 0, (size_t)-1);
>
> On 32bit archs size_t is still 32bit, AFAIR and file size can be bigger
> than that. I think we should also use off_t for length (GEOM does that).
Agreed, but I moved to specifying (start,end) instead of (start,len) already.
>
> + len = uio->uio_iov->iov_len;
> + addr = (vm_offset_t)uio->uio_iov->iov_base;
> + end = round_page(addr + len);
> + cnt = howmany(end - trunc_page(addr), PAGE_SIZE);
> + if (cnt > io_hold_cnt) {
> + len = io_hold_cnt * PAGE_SIZE;
> +again:
> + end = round_page(addr + len);
> + cnt = howmany(end - trunc_page(addr), PAGE_SIZE);
> + if (cnt > io_hold_cnt) {
> + len -= PAGE_SIZE;
> + goto again;
> + }
> + }
>
> This is a bit hard to understand immediately. Maybe something like the
> following is a bit more readable (I assume this goto could happen only
> once):
>
> len = MIN(uio->uio_iov->iov_len, io_hold_cnt * PAGE_SIZE);
> addr = (vm_offset_t)uio->uio_iov->iov_base;
> end = round_page(addr + len);
> if (howmany(end - trunc_page(addr), PAGE_SIZE) > io_hold_cnt) {
> len -= PAGE_SIZE;
> end = round_page(addr + len);
> }
I think it can happen twice, if both start and len are perfectly mis-aligned.
>
> +vn_truncate(struct file *fp, off_t length, struct ucred *active_cred,
> + struct thread *td)
> [...]
> + /*
> + * Lock the range where the shortening take place. Increase of
> + * file size does not need rangelock, but it is faster to lock
> + * the range then call VOP_GETATTR to get the current size and
> + * deal with races.
> + */
> + rl_cookie = vn_rangelock_wlock(vp, length, -1);
>
> Hmm. In case of increasing file size don't we need the range lock
> because we might be racing with write append?
We need to exclude the split i/o from being performed on the disjoint
ranges of the file if truncation happens in between split.
I changed the rangelock obtained in vn_truncate to get the whole range.
The bug appeared because I imported the rangelock from the bigger
patch that indeed handled the expansions.
Thank you for pointing this out.
>
> BTW. Comments in sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_rlock.c
> might be a useful read.
Thank you for the review.
http://people.freebsd.org/~kib/misc/vm1.8.patch
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 196 bytes
Desc: not available
Url : http://lists.freebsd.org/pipermail/freebsd-arch/attachments/20120205/5fd5a4e7/attachment.pgp
More information about the freebsd-arch
mailing list