LOR between allproc <-> ufs

Konstantin Belousov kostikbel at gmail.com
Thu May 26 15:58:51 UTC 2016


On Thu, May 26, 2016 at 04:13:43PM +0200, lists at yamagi.org wrote:
> Hello,
> having experienced several hard 'freezes' (system is still running, but
> all file system activity just hangs) on a FreeBSD 10.3-RELEASE machine,
> I've build a kernel with full debugging. Besides the well knows bufwait
> <-> dirhash LOR I've seen this one between allproc <-> ufs. Sadly the
> output is somewhat garbled, messages from LSIs upstream mps(4) driver
> interfered :
> 
> May 26 15:24:26 grobi kernel: 1st 0xffffffff81885b28 allproc (allproc) @ 
> /usr/src/sys/vm/vm_meter.c:122
> May 26 15:24:26 grobi kernel: 2nd 0xfffff8022c60b5f0 ufs (ufs) @ 
> /usr/src/sys/kern/vfs_subr.c:2364
> May 26 15:24:26 grobi kernel: KDB: stack backtrace:
> May 26 15:24:26 grobi kernel: xpt_release_simq: requested 1 > present 0
> May 26 15:24:26 grobi kernel: db_trace_self_wrapperxpt_release_simq: 
> requested 1 > present 0
> May 26 15:24:26 grobi kernel: () atxpt_release_simq: requested 1 > 
> present 0
> May 26 15:24:26 grobi kernel: xpt_release_simq: requested 1 > present 0
> May 26 15:24:26 grobi kernel: db_trace_self_wxpt_release_simq: requested 
> 1 > present 0
> May 26 15:24:26 grobi kernel: rxpt_release_simq: requested 1 > present 0
> May 26 15:24:26 grobi kernel: apper+xpt_release_simq: requested 1 > 
> present 0
> May 26 15:24:26 grobi kernel: 0x2b/fraxpt_release_simq: requested 1 > 
> present 0
> May 26 15:24:26 grobi kernel: me 0xfffffe1xpt_release_simq: requested 1 
>  > present 0
> May 26 15:24:26 grobi kernel: 04432b2d0
> May 26 15:24:26 grobi kernel: xpt_release_simq: requested 1 > present 0
> May 26 15:24:26 grobi kernel: xpt_release_simq: requested 1 > present 0
> May 26 15:24:26 grobi kernel: kxpt_release_simq: requested 1 > present 0
> May 26 15:24:26 grobi kernel: dxpt_release_simq: requested 1 > present 0
> May 26 15:24:26 grobi kernel: bxpt_release_simq: requested 1 > present 0
> May 26 15:24:26 grobi kernel: _backtrxpt_release_simq: requested 1 > 
> present 0
> May 26 15:24:26 grobi kernel: ace() atxpt_release_simq: requested 1 > 
> present 0
> May 26 15:24:26 grobi kernel: xpt_release_simq: requested 1 > present 0
> May 26 15:24:26 grobi kernel: kdb_backxpt_release_simq: requested 1 > 
> present 0
> May 26 15:24:26 grobi kernel: trace+0x3xpt_release_simq: requested 1 > 
> present 0
> May 26 15:24:26 grobi kernel: 9/frame 0xpt_release_simq: requested 1 > 
> present 0
> May 26 15:24:26 grobi kernel: xfffffe10xpt_release_simq: requested 1 > 
> present 0
> May 26 15:24:26 grobi kernel: 4432b380
> May 26 15:24:26 grobi kernel: xpt_release_simq: requested 1 > present 0
> May 26 15:24:26 grobi kernel: xpt_release_simq: requested 1 > present 0
> May 26 15:24:26 grobi kernel: wxpt_release_simq: requested 1 > present 0
> May 26 15:24:26 grobi kernel: itnxpt_release_simq: requested 1 > present 
> 0
> May 26 15:24:26 grobi kernel: ess_chexpt_release_simq: requested 1 > 
> present 0
> May 26 15:24:26 grobi kernel: ckordexpt_release_simq: requested 1 > 
> present 0
> May 26 15:24:26 grobi kernel: r(xpt_release_simq: requested 1 > present 
> 0
> May 26 15:24:26 grobi kernel: ) at witness_checkorder+0xe24/frame 
> 0xfffffe104432b410
> May 26 15:24:26 grobi kernel: __lockmgr_args() at 
> __lockmgr_args+0x9ea/frame 0xfffffe104432b550
> May 26 15:24:26 grobi kernel: ffs_lock() at ffs_lock+0x84/frame 
> 0xfffffe104432b5a0
> May 26 15:24:26 grobi kernel: VOP_LOCK1_APV() at 
> VOP_LOCK1_APV+0xfc/frame 0xfffffe104432b5d0
> May 26 15:24:26 grobi kernel: _vn_lock() at _vn_lock+0xaa/frame 
> 0xfffffe104432b640
> May 26 15:24:26 grobi kernel: vputx() at vputx+0x232/frame 
> 0xfffffe104432b6a0
> May 26 15:24:26 grobi kernel: vm_object_deallocate() at 
> vm_object_deallocate+0x5fb/frame 0xfffffe104432b700
> May 26 15:24:26 grobi kernel: vm_map_process_deferred() at 
> vm_map_process_deferred+0x89/frame 0xfffffe104432b730
> May 26 15:24:26 grobi kernel: vm_map_remove() at 
> vm_map_remove+0xc8/frame 0xfffffe104432b760
> May 26 15:24:26 grobi kernel: vmspace_free() at vmspace_free+0x51/frame 
> 0xfffffe104432b780
> May 26 15:24:26 grobi kernel: vmtotal() at vmtotal+0x2a5/frame 
> 0xfffffe104432b800
> May 26 15:24:26 grobi kernel: sysctl_root() at sysctl_root+0x1d2/frame 
> 0xfffffe104432b850
> May 26 15:24:26 grobi kernel: userland_sysctl() at 
> userland_sysctl+0x192/frame 0xfffffe104432b8f0
> May 26 15:24:26 grobi kernel: sys___sysctl() at sys___sysctl+0x74/frame 
> 0xfffffe104432b9a0
> May 26 15:24:26 grobi kernel: amd64_syscall() at 
> amd64_syscall+0x2c4/frame 0xfffffe104432bab0
> May 26 15:24:26 grobi kernel: Xfast_syscall() at 
> Xfast_syscall+0xfb/frame 0xfffffe104432bab0
> May 26 15:24:26 grobi kernel: --- syscall (202, FreeBSD ELF64, 
> sys___sysctl), rip = 0x802ef665a, rsp = 0x7fffffffe758, rbp = 
> 0x7fffffffe790 ---
> 
> Can anybody tell me if this is a know false positive? So far I
> was unable to reproduce it. If I manage to do so I'll send a
> better stacktrace.

I do not think this is a false positive, the order there is indeed
allproc_lock->vnode lock.  Reversed order can occur in, say, call
to mountcheckdirs() from unmount, where we own mp root vnode lock
and then iterate over all processes in the system.

I do not see an easy (or any) way to avoid the lock order on umount.
IMO we should avoid freeing reference on vmspace while owning the
allproc_lock lock, and not unlock the user map due to deferred
man entries free.

Completely untested patch is below.  I do not think that this LOR
can be reproduced at will.  E.g. you need to record the order
in witness by unmounting some filesystem mounted on a directory
on UFS mount.  Then, the chances must be that it was the last
reference on the vmspace, or that there were deferred vnode entry.

diff --git a/sys/vm/vm_glue.c b/sys/vm/vm_glue.c
index 83dc01a..c815c33 100644
--- a/sys/vm/vm_glue.c
+++ b/sys/vm/vm_glue.c
@@ -863,22 +863,32 @@ retry:
 		struct vmspace *vm;
 		int minslptime = 100000;
 		int slptime;
-		
+
+		PROC_LOCK(p);
 		/*
 		 * Watch out for a process in
 		 * creation.  It may have no
 		 * address space or lock yet.
 		 */
-		if (p->p_state == PRS_NEW)
+		if (p->p_state == PRS_NEW) {
+			PROC_UNLOCK(p);
 			continue;
+		}
 		/*
 		 * An aio daemon switches its
 		 * address space while running.
 		 * Perform a quick check whether
 		 * a process has P_SYSTEM.
+		 * Filter out exiting processes.
 		 */
-		if ((p->p_flag & P_SYSTEM) != 0)
+		if ((p->p_flag & (P_SYSTEM | P_WEXIT)) != 0) {
+			PROC_UNLOCK(p);
 			continue;
+		}
+		p->p_lock++;
+		PROC_UNLOCK(p);
+		sx_sunlock(&allproc_lock);
+
 		/*
 		 * Do not swapout a process that
 		 * is waiting for VM data
@@ -893,16 +903,15 @@ retry:
 		 */
 		vm = vmspace_acquire_ref(p);
 		if (vm == NULL)
-			continue;
+			goto nextproc2;
 		if (!vm_map_trylock(&vm->vm_map))
 			goto nextproc1;
 
 		PROC_LOCK(p);
-		if (p->p_lock != 0 ||
-		    (p->p_flag & (P_STOPPED_SINGLE|P_TRACED|P_SYSTEM|P_WEXIT)
-		    ) != 0) {
+		if (p->p_lock != 1 || (p->p_flag & (P_STOPPED_SINGLE |
+		    P_TRACED | P_SYSTEM)) != 0)
 			goto nextproc;
-		}
+
 		/*
 		 * only aiod changes vmspace, however it will be
 		 * skipped because of the if statement above checking 
@@ -977,12 +986,13 @@ retry:
 			if ((action & VM_SWAP_NORMAL) ||
 				((action & VM_SWAP_IDLE) &&
 				 (minslptime > swap_idle_threshold2))) {
+				_PRELE(p);
 				if (swapout(p) == 0)
 					didswap++;
 				PROC_UNLOCK(p);
+				sx_sunlock(&allproc_lock);
 				vm_map_unlock(&vm->vm_map);
 				vmspace_free(vm);
-				sx_sunlock(&allproc_lock);
 				goto retry;
 			}
 		}
@@ -991,7 +1001,9 @@ nextproc:
 		vm_map_unlock(&vm->vm_map);
 nextproc1:
 		vmspace_free(vm);
-		continue;
+nextproc2:
+		sx_slock(&allproc_lock);
+		PRELE(p);
 	}
 	sx_sunlock(&allproc_lock);
 	/*
diff --git a/sys/vm/vm_meter.c b/sys/vm/vm_meter.c
index f1573a3..f1d8c6f 100644
--- a/sys/vm/vm_meter.c
+++ b/sys/vm/vm_meter.c
@@ -153,14 +153,20 @@ vmtotal(SYSCTL_HANDLER_ARGS)
 				thread_unlock(td);
 			}
 		}
+		if ((p->p_flag & P_WEXIT) != 0)
+			continue;
+		p->p_lock++; /* avoid _PHOLD, no need for faultin() */
 		PROC_UNLOCK(p);
 		/*
 		 * Note active objects.
 		 */
 		paging = 0;
 		vm = vmspace_acquire_ref(p);
-		if (vm == NULL)
+		if (vm == NULL) {
+			PRELE(p);
 			continue;
+		}
+		sx_sunlock(&allproc_lock);
 		map = &vm->vm_map;
 		vm_map_lock_read(map);
 		for (entry = map->header.next;
@@ -177,6 +183,8 @@ vmtotal(SYSCTL_HANDLER_ARGS)
 		vmspace_free(vm);
 		if (paging)
 			total.t_pw++;
+		sx_slock(&allproc_lock);
+		PRELE(p);
 	}
 	sx_sunlock(&allproc_lock);
 	/*
diff --git a/sys/vm/vm_pageout.c b/sys/vm/vm_pageout.c
index ffdcb61..29d4155 100644
--- a/sys/vm/vm_pageout.c
+++ b/sys/vm/vm_pageout.c
@@ -1484,19 +1484,21 @@ vm_pageout_oom(int shortage)
 			PROC_UNLOCK(p);
 			continue;
 		}
-		_PHOLD(p);
+		p->p_lock++;
+		PROC_UNLOCK(p);
+		sx_sunlock(&allproc_lock);
 		if (!vm_map_trylock_read(&vm->vm_map)) {
-			_PRELE(p);
-			PROC_UNLOCK(p);
 			vmspace_free(vm);
+			sx_slock(&allproc_lock);
+			PRELE(p);
 			continue;
 		}
-		PROC_UNLOCK(p);
 		size = vmspace_swap_count(vm);
 		if (shortage == VM_OOM_MEM)
 			size += vm_pageout_oom_pagecount(vm);
 		vm_map_unlock_read(&vm->vm_map);
 		vmspace_free(vm);
+		sx_slock(&allproc_lock);
 
 		/*
 		 * If this process is bigger than the biggest one,
@@ -1794,9 +1796,13 @@ again:
 			if ((p->p_flag & P_INMEM) == 0)
 				limit = 0;	/* XXX */
 			vm = vmspace_acquire_ref(p);
+			p->p_lock++;
 			PROC_UNLOCK(p);
-			if (vm == NULL)
+			if (vm == NULL) {
+				PRELE(p);
 				continue;
+			}
+			sx_sunlock(&allproc_lock);
 
 			size = vmspace_resident_count(vm);
 			if (size >= limit) {
@@ -1841,6 +1847,8 @@ again:
 			}
 #endif
 			vmspace_free(vm);
+			sx_slock(&allproc_lock);
+			PRELE(p);
 		}
 		sx_sunlock(&allproc_lock);
 		if (tryagain != 0 && attempts <= 10)



More information about the freebsd-fs mailing list