vlrureclaim() race condition patch review request

Sun Aug 21 04:17:25 GMT 2005

vfs_subr.c 1.636 removed LK_NOWAIT from a VOP_LOCK() call in
vlrureclaim(), which worsed an existing race condition in this code.
This allowed the VI_DOOMED flag to be set by another thread after this
thread had checked VI_DOOMED and was waiting in VOP_LOCKED(), allowing
the mutex double unlock panic to be triggered as a side effect.  The
mutex double unlock was fixed in vfs_subr.c 1.639, but the race
condition itself was not fixed.

According to the commit log, LK_NOWAIT was removed so that this thread
might occasionally pause and allow other threads to run.  Even with the
removal of LK_NOWAIT, there is no guarantee that other threads won't be
starved, because there may be an arbitrarily large number of vnodes that
are processed before calling VOP_LOCK(), and there is no guarantee that
any of the VOP_LOCK() calls will block and allow other threads to
proceed.   Conversely, this thread might block for an arbitrarily long
period of time in VOP_LOCK(), which would interfere with vnode
recycling.  A more deterministic solution is to explicitly yield to
other threads on a periodic basis.

There was still a small race condition in the original code.  Even if
VOP_LOCK() does not block because it is called with LK_NOWAIT,
VOP_LOCK() still drops the vnode interlock, possibly allowing another
thread to invalidate the vnode status after it was previously checked.
At present, the only fix for this to relock the vnode interlock and
recheck the vnode status after the VOP_LOCK() call.

I suspect that it would be safe to hold the vnode interlock across the
call VOP_LOCK() call if VOP_LOCK() was called with the LK_NOWAIT flag
but not the LK_INTERLOCK flag, since there should be no danger of
deadlock.  This would avoid the need to relock the vnode interlock and
repeat the check of its state, but the vnode lock assertions complain
about this.

I'd appreciate a review of this patch.  I circulated an earlier version
of this patch, but a potential vnode lock leak was missed, and I had
also not done anything to compensate for the re-addition of LK_NOWAIT.

I'm not especially happy with all the goto nonsense at the end of the
loop.  It is an attempt to optimize out extra mutex operations without
duplicating a lot of code.  If anyone has suggestions on a better way to
write this ...


Index: sys/kern/vfs_subr.c
===================================================================
RCS file: /home/ncvs/src/sys/kern/vfs_subr.c,v
retrieving revision 1.640
diff -u -r1.640 vfs_subr.c

--- sys/kern/vfs_subr.c	13 Aug 2005 20:07:50 -0000	1.640
+++ sys/kern/vfs_subr.c	21 Aug 2005 01:31:32 -0000
@@ -570,29 +570,59 @@
 		TAILQ_INSERT_TAIL(&mp->mnt_nvnodelist, vp, v_nmntvnodes);
 		--count;
 		if (!VI_TRYLOCK(vp))
-			continue;
+			goto next_iter;
 		/*
 		 * If it's been deconstructed already, it's still
 		 * referenced, or it exceeds the trigger, skip it.
 		 */
-		if ((vp->v_iflag & VI_DOOMED) != 0 || vp->v_usecount ||
-		    !LIST_EMPTY(&(vp)->v_cache_src) || (vp->v_object != NULL && 
+		if (vp->v_usecount || !LIST_EMPTY(&(vp)->v_cache_src) ||
+		    (vp->v_iflag & VI_DOOMED) != 0 || (vp->v_object != NULL &&
 		    vp->v_object->resident_page_count > trigger)) {
 			VI_UNLOCK(vp);
-			continue;
+			goto next_iter;
 		}
 		MNT_IUNLOCK(mp);
 		vholdl(vp);
-		if (VOP_LOCK(vp, LK_INTERLOCK|LK_EXCLUSIVE, td)) {
+		if (VOP_LOCK(vp, LK_INTERLOCK|LK_EXCLUSIVE|LK_NOWAIT, td)) {
 			vdrop(vp);
-			MNT_ILOCK(mp);
-			continue;
+			goto next_iter_mntunlocked;
 		}
 		VI_LOCK(vp);
+		/*
+		 * v_usecount may have been bumped after VOP_LOCK() dropped
+		 * the vnode interlock and before it was locked again.
+		 *
+		 * It is not necessary to recheck VI_DOOMED because it can
+		 * only be set by another thread that holds both the vnode
+		 * lock and vnode interlock.  If another thread has the
+		 * vnode lock before we get to VOP_LOCK() and obtains the
+		 * vnode interlock after VOP_LOCK() drops the vnode
+		 * interlock, the other thread will be unable to drop the
+		 * vnode lock before our VOP_LOCK() call fails.
+		 */
+		if (vp->v_usecount || !LIST_EMPTY(&(vp)->v_cache_src) ||
+		    (vp->v_object != NULL && 
+		    vp->v_object->resident_page_count > trigger)) {
+			VOP_UNLOCK(vp, LK_INTERLOCK, td);
+			goto next_iter_mntunlocked;
+		}
+		KASSERT((vp->v_iflag & VI_DOOMED) == 0,
+		    ("VI_DOOMED unexpectedly detected in vlrureclaim()"));
 		vgonel(vp);
 		VOP_UNLOCK(vp, 0, td);
 		vdropl(vp);
 		done++;
+next_iter_mntunlocked:
+		if ((count % 256) != 0)
+			goto relock_mnt;
+		goto yield;
+next_iter:
+		if ((count % 256) != 0)
+			continue;
+		MNT_IUNLOCK(mp);
+yield:
+		uio_yield();
+relock_mnt:
 		MNT_ILOCK(mp);
 	}
 	MNT_IUNLOCK(mp);