Freebsd 11.0 RELEASE - ZFS deadlock

Andriy Gapon avg at FreeBSD.org
Fri Nov 18 12:31:24 UTC 2016


On 14/11/2016 14:00, Henri Hennebert wrote:
> On 11/14/2016 12:45, Andriy Gapon wrote:
>> Okay.  Luckily for us, it seems that 'm' is available in frame 5.  It also
>> happens to be the first field of 'struct faultstate'.  So, could you please go
>> to frame and print '*m' and '*(struct faultstate *)m' ?
>>
> (kgdb) fr 4
> #4  0xffffffff8089d1c1 in vm_page_busy_sleep (m=0xfffff800df68cd40, wmesg=<value
> optimized out>) at /usr/src/sys/vm/vm_page.c:753
> 753        msleep(m, vm_page_lockptr(m), PVM | PDROP, wmesg, 0);
> (kgdb) print *m
> $1 = {plinks = {q = {tqe_next = 0xfffff800dc5d85b0, tqe_prev =
> 0xfffff800debf3bd0}, s = {ss = {sle_next = 0xfffff800dc5d85b0},
>       pv = 0xfffff800debf3bd0}, memguard = {p = 18446735281313646000, v =
> 18446735281353604048}}, listq = {tqe_next = 0x0,
>     tqe_prev = 0xfffff800dc5d85c0}, object = 0xfffff800b62e9c60, pindex = 11,
> phys_addr = 3389358080, md = {pv_list = {
>       tqh_first = 0x0, tqh_last = 0xfffff800df68cd78}, pv_gen = 426, pat_mode =
> 6}, wire_count = 0, busy_lock = 6, hold_count = 0,
>   flags = 0, aflags = 2 '\002', oflags = 0 '\0', queue = 0 '\0', psind = 0 '\0',
> segind = 3 '\003', order = 13 '\r',
>   pool = 0 '\0', act_count = 0 '\0', valid = 0 '\0', dirty = 0 '\0'}

If I interpret this correctly the page is in the 'exclusive busy' state.
Unfortunately, I can't tell much beyond that.
But I am confident that this is the root cause of the lock-up.

> (kgdb) print *(struct faultstate *)m
> $2 = {m = 0xfffff800dc5d85b0, object = 0xfffff800debf3bd0, pindex = 0, first_m =
> 0xfffff800dc5d85c0,
>   first_object = 0xfffff800b62e9c60, first_pindex = 11, map = 0xca058000, entry
> = 0x0, lookup_still_valid = -546779784,
>   vp = 0x6000001aa}
> (kgdb)

I was wrong on this one as 'm' is actually a pointer, so the above is not
correct.  Maybe 'info reg' in frame 5 would give a clue about the value of 'fs'.

I am not sure how to proceed from here.
The only thing I can think of is a lock order reversal between the vnode lock
and the page busying quasi-lock.  But examining the code I can not spot it.
Another possibility is a leak of a busy page, but that's hard to debug.

How hard is it to reproduce the problem?

Maybe Konstantin would have some ideas or suggestions.

-- 
Andriy Gapon


More information about the freebsd-stable mailing list