Freebsd 11.0 RELEASE - ZFS deadlock
Henri Hennebert
hlh at restart.be
Fri Nov 18 13:16:02 UTC 2016
On 11/18/2016 13:30, Andriy Gapon wrote:
> On 14/11/2016 14:00, Henri Hennebert wrote:
>> On 11/14/2016 12:45, Andriy Gapon wrote:
>>> Okay. Luckily for us, it seems that 'm' is available in frame 5. It also
>>> happens to be the first field of 'struct faultstate'. So, could you please go
>>> to frame and print '*m' and '*(struct faultstate *)m' ?
>>>
>> (kgdb) fr 4
>> #4 0xffffffff8089d1c1 in vm_page_busy_sleep (m=0xfffff800df68cd40, wmesg=<value
>> optimized out>) at /usr/src/sys/vm/vm_page.c:753
>> 753 msleep(m, vm_page_lockptr(m), PVM | PDROP, wmesg, 0);
>> (kgdb) print *m
>> $1 = {plinks = {q = {tqe_next = 0xfffff800dc5d85b0, tqe_prev =
>> 0xfffff800debf3bd0}, s = {ss = {sle_next = 0xfffff800dc5d85b0},
>> pv = 0xfffff800debf3bd0}, memguard = {p = 18446735281313646000, v =
>> 18446735281353604048}}, listq = {tqe_next = 0x0,
>> tqe_prev = 0xfffff800dc5d85c0}, object = 0xfffff800b62e9c60, pindex = 11,
>> phys_addr = 3389358080, md = {pv_list = {
>> tqh_first = 0x0, tqh_last = 0xfffff800df68cd78}, pv_gen = 426, pat_mode =
>> 6}, wire_count = 0, busy_lock = 6, hold_count = 0,
>> flags = 0, aflags = 2 '\002', oflags = 0 '\0', queue = 0 '\0', psind = 0 '\0',
>> segind = 3 '\003', order = 13 '\r',
>> pool = 0 '\0', act_count = 0 '\0', valid = 0 '\0', dirty = 0 '\0'}
>
> If I interpret this correctly the page is in the 'exclusive busy' state.
> Unfortunately, I can't tell much beyond that.
> But I am confident that this is the root cause of the lock-up.
>
>> (kgdb) print *(struct faultstate *)m
>> $2 = {m = 0xfffff800dc5d85b0, object = 0xfffff800debf3bd0, pindex = 0, first_m =
>> 0xfffff800dc5d85c0,
>> first_object = 0xfffff800b62e9c60, first_pindex = 11, map = 0xca058000, entry
>> = 0x0, lookup_still_valid = -546779784,
>> vp = 0x6000001aa}
>> (kgdb)
>
> I was wrong on this one as 'm' is actually a pointer, so the above is not
> correct. Maybe 'info reg' in frame 5 would give a clue about the value of 'fs'.
(kgdb) fr 5
#5 0xffffffff8089dd4d in vm_page_sleep_if_busy (m=0xfffff800df68cd40,
msg=0xffffffff809c51bc "vmpfw")
at /usr/src/sys/vm/vm_page.c:1086
1086 vm_page_busy_sleep(m, msg);
(kgdb) info reg
rax 0x0 0
rbx 0xfffff800b62e9c78 -8793036514184
rcx 0x0 0
rdx 0x0 0
rsi 0x0 0
rdi 0x0 0
rbp 0xfffffe0101836810 0xfffffe0101836810
rsp 0xfffffe01018367e0 0xfffffe01018367e0
r8 0x0 0
r9 0x0 0
r10 0x0 0
r11 0x0 0
r12 0xfffff800b642aa00 -8793035200000
r13 0xfffff800df68cd40 -8792344834752
r14 0xfffff800b62e9c60 -8793036514208
r15 0xffffffff809c51bc -2137239108
rip 0xffffffff8089dd4d 0xffffffff8089dd4d
<vm_page_sleep_if_busy+285>
eflags 0x0 0
cs 0x0 0
ss 0x0 0
ds 0x0 0
es 0x0 0
fs 0x0 0
gs 0x0 0
I don't know what to do from here.
>
> I am not sure how to proceed from here.
> The only thing I can think of is a lock order reversal between the vnode lock
> and the page busying quasi-lock. But examining the code I can not spot it.
> Another possibility is a leak of a busy page, but that's hard to debug.
>
> How hard is it to reproduce the problem?
After 7 days all seems normal only one copy of innd:
[root at avoriaz ~]# ps xa|grep inn
1193 - Is 0:01.40 /usr/local/news/bin/innd -r
13498 - IN 0:00.01 /usr/local/news/bin/innfeed
1194 v0- IW 0:00.00 /bin/sh /usr/local/news/bin/innwatch -i 60
I will try to stop and restart innd.
All continue to look good:
[root at avoriaz ~]# ps xa|grep inn
31673 - Ss 0:00.02 /usr/local/news/bin/innd
31694 - SN 0:00.01 /usr/local/news/bin/innfeed
31674 0 S 0:00.01 /bin/sh /usr/local/news/bin/innwatch -i 60
I think to reproduce is just waiting it occurs by itself...
One thing here: The deadlock occurs at least 5 times since 10.0R. And
always with the directory /usr/local/news/bin
>
> Maybe Konstantin would have some ideas or suggestions.
>
Henri
More information about the freebsd-stable
mailing list