LOR between allproc <-> ufs
Konstantin Belousov
kostikbel at gmail.com
Thu May 26 15:58:51 UTC 2016
On Thu, May 26, 2016 at 04:13:43PM +0200, lists at yamagi.org wrote:
> Hello,
> having experienced several hard 'freezes' (system is still running, but
> all file system activity just hangs) on a FreeBSD 10.3-RELEASE machine,
> I've build a kernel with full debugging. Besides the well knows bufwait
> <-> dirhash LOR I've seen this one between allproc <-> ufs. Sadly the
> output is somewhat garbled, messages from LSIs upstream mps(4) driver
> interfered :
>
> May 26 15:24:26 grobi kernel: 1st 0xffffffff81885b28 allproc (allproc) @
> /usr/src/sys/vm/vm_meter.c:122
> May 26 15:24:26 grobi kernel: 2nd 0xfffff8022c60b5f0 ufs (ufs) @
> /usr/src/sys/kern/vfs_subr.c:2364
> May 26 15:24:26 grobi kernel: KDB: stack backtrace:
> May 26 15:24:26 grobi kernel: xpt_release_simq: requested 1 > present 0
> May 26 15:24:26 grobi kernel: db_trace_self_wrapperxpt_release_simq:
> requested 1 > present 0
> May 26 15:24:26 grobi kernel: () atxpt_release_simq: requested 1 >
> present 0
> May 26 15:24:26 grobi kernel: xpt_release_simq: requested 1 > present 0
> May 26 15:24:26 grobi kernel: db_trace_self_wxpt_release_simq: requested
> 1 > present 0
> May 26 15:24:26 grobi kernel: rxpt_release_simq: requested 1 > present 0
> May 26 15:24:26 grobi kernel: apper+xpt_release_simq: requested 1 >
> present 0
> May 26 15:24:26 grobi kernel: 0x2b/fraxpt_release_simq: requested 1 >
> present 0
> May 26 15:24:26 grobi kernel: me 0xfffffe1xpt_release_simq: requested 1
> > present 0
> May 26 15:24:26 grobi kernel: 04432b2d0
> May 26 15:24:26 grobi kernel: xpt_release_simq: requested 1 > present 0
> May 26 15:24:26 grobi kernel: xpt_release_simq: requested 1 > present 0
> May 26 15:24:26 grobi kernel: kxpt_release_simq: requested 1 > present 0
> May 26 15:24:26 grobi kernel: dxpt_release_simq: requested 1 > present 0
> May 26 15:24:26 grobi kernel: bxpt_release_simq: requested 1 > present 0
> May 26 15:24:26 grobi kernel: _backtrxpt_release_simq: requested 1 >
> present 0
> May 26 15:24:26 grobi kernel: ace() atxpt_release_simq: requested 1 >
> present 0
> May 26 15:24:26 grobi kernel: xpt_release_simq: requested 1 > present 0
> May 26 15:24:26 grobi kernel: kdb_backxpt_release_simq: requested 1 >
> present 0
> May 26 15:24:26 grobi kernel: trace+0x3xpt_release_simq: requested 1 >
> present 0
> May 26 15:24:26 grobi kernel: 9/frame 0xpt_release_simq: requested 1 >
> present 0
> May 26 15:24:26 grobi kernel: xfffffe10xpt_release_simq: requested 1 >
> present 0
> May 26 15:24:26 grobi kernel: 4432b380
> May 26 15:24:26 grobi kernel: xpt_release_simq: requested 1 > present 0
> May 26 15:24:26 grobi kernel: xpt_release_simq: requested 1 > present 0
> May 26 15:24:26 grobi kernel: wxpt_release_simq: requested 1 > present 0
> May 26 15:24:26 grobi kernel: itnxpt_release_simq: requested 1 > present
> 0
> May 26 15:24:26 grobi kernel: ess_chexpt_release_simq: requested 1 >
> present 0
> May 26 15:24:26 grobi kernel: ckordexpt_release_simq: requested 1 >
> present 0
> May 26 15:24:26 grobi kernel: r(xpt_release_simq: requested 1 > present
> 0
> May 26 15:24:26 grobi kernel: ) at witness_checkorder+0xe24/frame
> 0xfffffe104432b410
> May 26 15:24:26 grobi kernel: __lockmgr_args() at
> __lockmgr_args+0x9ea/frame 0xfffffe104432b550
> May 26 15:24:26 grobi kernel: ffs_lock() at ffs_lock+0x84/frame
> 0xfffffe104432b5a0
> May 26 15:24:26 grobi kernel: VOP_LOCK1_APV() at
> VOP_LOCK1_APV+0xfc/frame 0xfffffe104432b5d0
> May 26 15:24:26 grobi kernel: _vn_lock() at _vn_lock+0xaa/frame
> 0xfffffe104432b640
> May 26 15:24:26 grobi kernel: vputx() at vputx+0x232/frame
> 0xfffffe104432b6a0
> May 26 15:24:26 grobi kernel: vm_object_deallocate() at
> vm_object_deallocate+0x5fb/frame 0xfffffe104432b700
> May 26 15:24:26 grobi kernel: vm_map_process_deferred() at
> vm_map_process_deferred+0x89/frame 0xfffffe104432b730
> May 26 15:24:26 grobi kernel: vm_map_remove() at
> vm_map_remove+0xc8/frame 0xfffffe104432b760
> May 26 15:24:26 grobi kernel: vmspace_free() at vmspace_free+0x51/frame
> 0xfffffe104432b780
> May 26 15:24:26 grobi kernel: vmtotal() at vmtotal+0x2a5/frame
> 0xfffffe104432b800
> May 26 15:24:26 grobi kernel: sysctl_root() at sysctl_root+0x1d2/frame
> 0xfffffe104432b850
> May 26 15:24:26 grobi kernel: userland_sysctl() at
> userland_sysctl+0x192/frame 0xfffffe104432b8f0
> May 26 15:24:26 grobi kernel: sys___sysctl() at sys___sysctl+0x74/frame
> 0xfffffe104432b9a0
> May 26 15:24:26 grobi kernel: amd64_syscall() at
> amd64_syscall+0x2c4/frame 0xfffffe104432bab0
> May 26 15:24:26 grobi kernel: Xfast_syscall() at
> Xfast_syscall+0xfb/frame 0xfffffe104432bab0
> May 26 15:24:26 grobi kernel: --- syscall (202, FreeBSD ELF64,
> sys___sysctl), rip = 0x802ef665a, rsp = 0x7fffffffe758, rbp =
> 0x7fffffffe790 ---
>
> Can anybody tell me if this is a know false positive? So far I
> was unable to reproduce it. If I manage to do so I'll send a
> better stacktrace.
I do not think this is a false positive, the order there is indeed
allproc_lock->vnode lock. Reversed order can occur in, say, call
to mountcheckdirs() from unmount, where we own mp root vnode lock
and then iterate over all processes in the system.
I do not see an easy (or any) way to avoid the lock order on umount.
IMO we should avoid freeing reference on vmspace while owning the
allproc_lock lock, and not unlock the user map due to deferred
man entries free.
Completely untested patch is below. I do not think that this LOR
can be reproduced at will. E.g. you need to record the order
in witness by unmounting some filesystem mounted on a directory
on UFS mount. Then, the chances must be that it was the last
reference on the vmspace, or that there were deferred vnode entry.
diff --git a/sys/vm/vm_glue.c b/sys/vm/vm_glue.c
index 83dc01a..c815c33 100644
--- a/sys/vm/vm_glue.c
+++ b/sys/vm/vm_glue.c
@@ -863,22 +863,32 @@ retry:
struct vmspace *vm;
int minslptime = 100000;
int slptime;
-
+
+ PROC_LOCK(p);
/*
* Watch out for a process in
* creation. It may have no
* address space or lock yet.
*/
- if (p->p_state == PRS_NEW)
+ if (p->p_state == PRS_NEW) {
+ PROC_UNLOCK(p);
continue;
+ }
/*
* An aio daemon switches its
* address space while running.
* Perform a quick check whether
* a process has P_SYSTEM.
+ * Filter out exiting processes.
*/
- if ((p->p_flag & P_SYSTEM) != 0)
+ if ((p->p_flag & (P_SYSTEM | P_WEXIT)) != 0) {
+ PROC_UNLOCK(p);
continue;
+ }
+ p->p_lock++;
+ PROC_UNLOCK(p);
+ sx_sunlock(&allproc_lock);
+
/*
* Do not swapout a process that
* is waiting for VM data
@@ -893,16 +903,15 @@ retry:
*/
vm = vmspace_acquire_ref(p);
if (vm == NULL)
- continue;
+ goto nextproc2;
if (!vm_map_trylock(&vm->vm_map))
goto nextproc1;
PROC_LOCK(p);
- if (p->p_lock != 0 ||
- (p->p_flag & (P_STOPPED_SINGLE|P_TRACED|P_SYSTEM|P_WEXIT)
- ) != 0) {
+ if (p->p_lock != 1 || (p->p_flag & (P_STOPPED_SINGLE |
+ P_TRACED | P_SYSTEM)) != 0)
goto nextproc;
- }
+
/*
* only aiod changes vmspace, however it will be
* skipped because of the if statement above checking
@@ -977,12 +986,13 @@ retry:
if ((action & VM_SWAP_NORMAL) ||
((action & VM_SWAP_IDLE) &&
(minslptime > swap_idle_threshold2))) {
+ _PRELE(p);
if (swapout(p) == 0)
didswap++;
PROC_UNLOCK(p);
+ sx_sunlock(&allproc_lock);
vm_map_unlock(&vm->vm_map);
vmspace_free(vm);
- sx_sunlock(&allproc_lock);
goto retry;
}
}
@@ -991,7 +1001,9 @@ nextproc:
vm_map_unlock(&vm->vm_map);
nextproc1:
vmspace_free(vm);
- continue;
+nextproc2:
+ sx_slock(&allproc_lock);
+ PRELE(p);
}
sx_sunlock(&allproc_lock);
/*
diff --git a/sys/vm/vm_meter.c b/sys/vm/vm_meter.c
index f1573a3..f1d8c6f 100644
--- a/sys/vm/vm_meter.c
+++ b/sys/vm/vm_meter.c
@@ -153,14 +153,20 @@ vmtotal(SYSCTL_HANDLER_ARGS)
thread_unlock(td);
}
}
+ if ((p->p_flag & P_WEXIT) != 0)
+ continue;
+ p->p_lock++; /* avoid _PHOLD, no need for faultin() */
PROC_UNLOCK(p);
/*
* Note active objects.
*/
paging = 0;
vm = vmspace_acquire_ref(p);
- if (vm == NULL)
+ if (vm == NULL) {
+ PRELE(p);
continue;
+ }
+ sx_sunlock(&allproc_lock);
map = &vm->vm_map;
vm_map_lock_read(map);
for (entry = map->header.next;
@@ -177,6 +183,8 @@ vmtotal(SYSCTL_HANDLER_ARGS)
vmspace_free(vm);
if (paging)
total.t_pw++;
+ sx_slock(&allproc_lock);
+ PRELE(p);
}
sx_sunlock(&allproc_lock);
/*
diff --git a/sys/vm/vm_pageout.c b/sys/vm/vm_pageout.c
index ffdcb61..29d4155 100644
--- a/sys/vm/vm_pageout.c
+++ b/sys/vm/vm_pageout.c
@@ -1484,19 +1484,21 @@ vm_pageout_oom(int shortage)
PROC_UNLOCK(p);
continue;
}
- _PHOLD(p);
+ p->p_lock++;
+ PROC_UNLOCK(p);
+ sx_sunlock(&allproc_lock);
if (!vm_map_trylock_read(&vm->vm_map)) {
- _PRELE(p);
- PROC_UNLOCK(p);
vmspace_free(vm);
+ sx_slock(&allproc_lock);
+ PRELE(p);
continue;
}
- PROC_UNLOCK(p);
size = vmspace_swap_count(vm);
if (shortage == VM_OOM_MEM)
size += vm_pageout_oom_pagecount(vm);
vm_map_unlock_read(&vm->vm_map);
vmspace_free(vm);
+ sx_slock(&allproc_lock);
/*
* If this process is bigger than the biggest one,
@@ -1794,9 +1796,13 @@ again:
if ((p->p_flag & P_INMEM) == 0)
limit = 0; /* XXX */
vm = vmspace_acquire_ref(p);
+ p->p_lock++;
PROC_UNLOCK(p);
- if (vm == NULL)
+ if (vm == NULL) {
+ PRELE(p);
continue;
+ }
+ sx_sunlock(&allproc_lock);
size = vmspace_resident_count(vm);
if (size >= limit) {
@@ -1841,6 +1847,8 @@ again:
}
#endif
vmspace_free(vm);
+ sx_slock(&allproc_lock);
+ PRELE(p);
}
sx_sunlock(&allproc_lock);
if (tryagain != 0 && attempts <= 10)
More information about the freebsd-fs
mailing list