Freebsd 11.0 RELEASE - ZFS deadlock

Henri Hennebert hlh at restart.be
Thu Nov 10 19:42:01 UTC 2016


On 11/10/2016 19:40, Andriy Gapon wrote:
> On 10/11/2016 19:55, Henri Hennebert wrote:
>>
>>
>> On 11/10/2016 18:33, Andriy Gapon wrote:
>>> On 10/11/2016 18:12, Henri Hennebert wrote:
>>>> On 11/10/2016 16:54, Andriy Gapon wrote:
>>>>> On 10/11/2016 17:20, Henri Hennebert wrote:
>>>>>> On 11/10/2016 15:00, Andriy Gapon wrote:
>>>>>>> Interesting.  I can not spot any suspicious thread that would hold the vnode
>>>>>>> lock.  Could you please run kgdb (just like that, no arguments), then execute
>>>>>>> 'bt' command and then select a frame when _vn_lock is called with 'fr N'
>>>>>>> command.  Then please 'print *vp' and share the result.
>>>>>>>
>>>>>> I Think I miss something in your request:
>>>>>
>>>>> Oh, sorry!  The very first step should be 'tid 101112' to switch to the correct
>>>>> context.
>>>>>
>>>>
>>>> (kgdb) fr 7
>>>> #7  0xffffffff8063c5b3 in _vn_lock (vp=<value optimized out>, flags=2121728,
>>>
>>> "value optimized out" - not good
>>>
>>>> file=<value optimized out>,
>>>>     line=<value optimized out>) at vnode_if.h:859
>>>> 859    vnode_if.h: No such file or directory.
>>>>     in vnode_if.h
>>>> (kgdb) print *vp
>>>
>>> I am not sure if this output is valid, because of the message above.
>>> Could you please try to navigate to nearby frames and see if vp itself has a
>>> valid value there.  If you can find such a frame please do *vp  there.
>>>
>>
>> Does this seems better?
>
> Yes!
>
>> (kgdb) fr 8
>> #8  0xffffffff8062a5f7 in vget (vp=0xfffff80049c2c000, flags=2121728,
>> td=0xfffff80009ba0500) at /usr/src/sys/kern/vfs_subr.c:2523
>> 2523        if ((error = vn_lock(vp, flags)) != 0) {
>> (kgdb) print *vp
>> $1 = {v_tag = 0xffffffff813be535 "zfs", v_op = 0xffffffff813d0f70, v_data =
>> 0xfffff80049c1f420, v_mount = 0xfffff800093aa660,
>>   v_nmntvnodes = {tqe_next = 0xfffff80049c2c938, tqe_prev = 0xfffff80049c2bb30},
>> v_un = {vu_mount = 0x0, vu_socket = 0x0,
>>     vu_cdev = 0x0, vu_fifoinfo = 0x0}, v_hashlist = {le_next = 0x0, le_prev =
>> 0x0}, v_cache_src = {lh_first = 0x0}, v_cache_dst = {
>>     tqh_first = 0xfffff800bfc8e3f0, tqh_last = 0xfffff800bfc8e410}, v_cache_dd =
>> 0x0, v_lock = {lock_object = {
>>       lo_name = 0xffffffff813be535 "zfs", lo_flags = 117112832, lo_data = 0,
>> lo_witness = 0x0}, lk_lock = 23, lk_exslpfail = 0,
>>     lk_timo = 51, lk_pri = 96}, v_interlock = {lock_object = {lo_name =
>> 0xffffffff8099e9e0 "vnode interlock", lo_flags = 16973824,
>>       lo_data = 0, lo_witness = 0x0}, mtx_lock = 4}, v_vnlock =
>> 0xfffff80049c2c068, v_actfreelist = {
>>     tqe_next = 0xfffff80049c2c938, tqe_prev = 0xfffff80049ae9bd0}, v_bufobj =
>> {bo_lock = {lock_object = {
>>         lo_name = 0xffffffff8099e9f0 "bufobj interlock", lo_flags = 86179840,
>> lo_data = 0, lo_witness = 0x0}, rw_lock = 1},
>>     bo_ops = 0xffffffff80c4bf70, bo_object = 0xfffff800b62e9c60, bo_synclist =
>> {le_next = 0x0, le_prev = 0x0},
>>     bo_private = 0xfffff80049c2c000, __bo_vnode = 0xfffff80049c2c000, bo_clean =
>> {bv_hd = {tqh_first = 0x0,
>>         tqh_last = 0xfffff80049c2c120}, bv_root = {pt_root = 0}, bv_cnt = 0},
>> bo_dirty = {bv_hd = {tqh_first = 0x0,
>>         tqh_last = 0xfffff80049c2c140}, bv_root = {pt_root = 0}, bv_cnt = 0},
>> bo_numoutput = 0, bo_flag = 0, bo_bsize = 131072},
>>   v_pollinfo = 0x0, v_label = 0x0, v_lockf = 0x0, v_rl = {rl_waiters =
>> {tqh_first = 0x0, tqh_last = 0xfffff80049c2c188},
>>     rl_currdep = 0x0}, v_cstart = 0, v_lasta = 0, v_lastw = 0, v_clen = 0,
>> v_holdcnt = 9, v_usecount = 6, v_iflag = 512,
>>   v_vflag = 32, v_writecount = 0, v_hash = 4833984, v_type = VREG}
>> (kgdb)
>
> flags=2121728 = 0x206000 = LK_SHARED | LK_VNHELD | LK_NODDLKTREAT
> lk_lock = 23 = 0x17 = LK_ONE_SHARER | LK_EXCLUSIVE_WAITERS | LK_SHARED_WAITERS |
> LK_SHARE
>
> So, here's what we have here: this thread tries to get a shared lock on the
> vnode, the vnode is already locked in shared mode, but there is an exclusive
> waiter (or, perhaps, multiple waiters).  So, this thread can not get the lock
> because of the exclusive waiter.  And I do not see an easy way to identify that
> waiter.
>
> In the procstat output that you provided earlier there was no other thread in
> vn_lock.  Hmm, I see this:
> procstat: sysctl: kern.proc.kstack: 14789: Device busy
> procstat: sysctl: kern.proc.kstack: 82034: Device busy
>
> Could you please check what those two processes are (if they are still running)?
> Perhaps try procstat for each of the pids several times.
>

This 2 processes are the 2 instances of the innd daemon (news server) 
which seems in accordance with the directory /usr/local/news/bin.

[root at avoriaz ~]# procstat 14789
   PID  PPID  PGID   SID  TSID THR LOGIN    WCHAN     EMUL          COMM
14789 29165 29165 29165     0   1 root     zfs       FreeBSD ELF64 innd
[root at avoriaz ~]# procstat 82034
   PID  PPID  PGID   SID  TSID THR LOGIN    WCHAN     EMUL          COMM
82034 29165 29165 29165     0   1 root     zfs       FreeBSD ELF64 innd
[root at avoriaz ~]# procstat -f 14789
procstat: kinfo_getfile(): Device busy
   PID COMM                FD T V FLAGS    REF  OFFSET PRO NAME
[root at avoriaz ~]# procstat -f 14789
procstat: kinfo_getfile(): Device busy
   PID COMM                FD T V FLAGS    REF  OFFSET PRO NAME
[root at avoriaz ~]# procstat -f 14789
procstat: kinfo_getfile(): Device busy
   PID COMM                FD T V FLAGS    REF  OFFSET PRO NAME
[root at avoriaz ~]# procstat -f 14789
procstat: kinfo_getfile(): Device busy
   PID COMM                FD T V FLAGS    REF  OFFSET PRO NAME
[root at avoriaz ~]# procstat -kk 14789
   PID    TID COMM             TDNAME           KSTACK
procstat: sysctl: kern.proc.kstack: 14789: Device busy
[root at avoriaz ~]# procstat -kk 14789
   PID    TID COMM             TDNAME           KSTACK
procstat: sysctl: kern.proc.kstack: 14789: Device busy
[root at avoriaz ~]# procstat -kk 14789
   PID    TID COMM             TDNAME           KSTACK
procstat: sysctl: kern.proc.kstack: 14789: Device busy
[root at avoriaz ~]# procstat -kk 14789
   PID    TID COMM             TDNAME           KSTACK
procstat: sysctl: kern.proc.kstack: 14789: Device busy
[root at avoriaz ~]#

Same result for 82034.

By the way the news server seems OK:

[root at avoriaz ~]# ctlinnd mode
Server running
Allowing remote connections
Parameters c 14 i 50 (0) l 5000000 o 1010 t 300 H 2 T 60 X 0 normal 
specified
Not reserved
Readers follow enabled
Perl filtering enabled
[root at avoriaz ~]#

I can access the news server with thunderbird.

I can restart the news server if you ask

Henri


More information about the freebsd-stable mailing list