Process enters unkillable state and somewhat wedges zfs

Konstantin Belousov kostikbel at gmail.com
Fri Oct 17 13:54:21 UTC 2014


On Thu, Oct 16, 2014 at 03:33:24PM -0700, Daniel Andersen wrote:
> A further update:  we almost went 3 weeks without a hang after moving /home back off zfs.
> Then someone used sudo in a nullfs <-> zfs directory and we got the same thing.  Interesting
> that it's almost always sudo.  Only other hangs have been when a program coredumps.
> 
> ( by nullfs <-> zfs directory, I mean we have a zfs directory /tank/foo and a nullfs mount /data/foo that people use to
> access )
Read
https://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html
for instructions how to obtain useful debugging information.

I did not found any mention of the version of the FreeBSD you use.
 
> 
> Dan
> 
> On 09/26/2014 02:47 PM, Daniel Andersen wrote:
> > Okay, we were finally able to collect a trace on this.  I took two, just in
> > case.  basically did a tr <pid>, continue, go back into ddb and did another.
> > 
> > Tracing pid 89483 tid 101168 td 0xfffff8048e301920
> > cpustop_handler() at cpustop_handler+0x28/frame 0xfffffe2f395e1ce0
> > ipi_nmi_handler() at ipi_nmi_handler+0x3f/frame 0xfffffe2f395e1d00
> > trap() at trap+0x42/frame 0xfffffe2f395e1f20
> > nmi_calltrap() at nmi_calltrap+0x8/frame 0xfffffe2f395e1f20
> > --- trap 0x13, rip = 0xffffffff808aada0, rsp = 0xfffffe2f395e1fe0, rbp = 0xfffffe2ffca2bf60 ---
> > __lockmgr_args() at __lockmgr_args+0x30/frame 0xfffffe2ffca2bf60
> > vop_stdlock() at vop_stdlock+0x3c/frame 0xfffffe2ffca2bf80
> > VOP_LOCK1_APV() at VOP_LOCK1_APV+0x9d/frame 0xfffffe2ffca2bfb0
> > _vn_lock() at _vn_lock+0x43/frame 0xfffffe2ffca2c010
> > zfs_lookup() at zfs_lookup+0x3a1/frame 0xfffffe2ffca2c0a0
> > zfs_freebsd_lookup() at zfs_freebsd_lookup+0x6d/frame 0xfffffe2ffca2c1e0
> > VOP_CACHEDLOOKUP_APV() at VOP_CACHEDLOOKUP_APV+0x92/frame 0xfffffe2ffca2c210
> > vfs_cache_lookup() at vfs_cache_lookup+0xcf/frame 0xfffffe2ffca2c260
> > VOP_LOOKUP_APV() at VOP_LOOKUP_APV+0x92/frame 0xfffffe2ffca2c290
> > null_lookup() at null_lookup+0x8b/frame 0xfffffe2ffca2c300
> > VOP_LOOKUP_APV() at VOP_LOOKUP_APV+0x92/frame 0xfffffe2ffca2c330
> > lookup() at lookup+0x58b/frame 0xfffffe2ffca2c3b0
> > namei() at namei+0x504/frame 0xfffffe2ffca2c480
> > vn_open_cred() at vn_open_cred+0x232/frame 0xfffffe2ffca2c5d0
> > vop_stdvptocnp() at vop_stdvptocnp+0x17d/frame 0xfffffe2ffca2c910
> > null_vptocnp() at null_vptocnp+0x2b/frame 0xfffffe2ffca2c970
> > VOP_VPTOCNP_APV() at VOP_VPTOCNP_APV+0x98/frame 0xfffffe2ffca2c9a0
> > vn_vptocnp_locked() at vn_vptocnp_locked+0x113/frame 0xfffffe2ffca2ca20
> > vn_fullpath1() at vn_fullpath1+0x1b2/frame 0xfffffe2ffca2ca80
> > kern___getcwd() at kern___getcwd+0x115/frame 0xfffffe2ffca2cae0
> > amd64_syscall() at amd64_syscall+0x357/frame 0xfffffe2ffca2cbf0
> > Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfffffe2ffca2cbf0
> > --- syscall (326, FreeBSD ELF64, sys___getcwd), rip = 0x2000cd8cda, rsp = 0x7fffffffd358, rbp = 0x7fffffffd440 ---
> > 
> > 
> > 
> > Tracing pid 89483 tid 101168 td 0xfffff8048e301920
> > cpustop_handler() at cpustop_handler+0x28/frame 0xfffffe2f395e1ce0
> > ipi_nmi_handler() at ipi_nmi_handler+0x3f/frame 0xfffffe2f395e1d00
> > trap() at trap+0x42/frame 0xfffffe2f395e1f20
> > nmi_calltrap() at nmi_calltrap+0x8/frame 0xfffffe2f395e1f20
> > --- trap 0x13, rip = 0xffffffff80e6110e, rsp = 0xfffffe2f395e1fe0, rbp = 0xfffffe2ffca2bfb0 ---
> > VOP_LOCK1_APV() at VOP_LOCK1_APV+0xae/frame 0xfffffe2ffca2bfb0
> > _vn_lock() at _vn_lock+0x43/frame 0xfffffe2ffca2c010
> > zfs_lookup() at zfs_lookup+0x3a1/frame 0xfffffe2ffca2c0a0
> > zfs_freebsd_lookup() at zfs_freebsd_lookup+0x6d/frame 0xfffffe2ffca2c1e0
> > VOP_CACHEDLOOKUP_APV() at VOP_CACHEDLOOKUP_APV+0x92/frame 0xfffffe2ffca2c210
> > vfs_cache_lookup() at vfs_cache_lookup+0xcf/frame 0xfffffe2ffca2c260
> > VOP_LOOKUP_APV() at VOP_LOOKUP_APV+0x92/frame 0xfffffe2ffca2c290
> > null_lookup() at null_lookup+0x8b/frame 0xfffffe2ffca2c300
> > VOP_LOOKUP_APV() at VOP_LOOKUP_APV+0x92/frame 0xfffffe2ffca2c330
> > lookup() at lookup+0x58b/frame 0xfffffe2ffca2c3b0
> > namei() at namei+0x504/frame 0xfffffe2ffca2c480
> > vn_open_cred() at vn_open_cred+0x232/frame 0xfffffe2ffca2c5d0
> > vop_stdvptocnp() at vop_stdvptocnp+0x17d/frame 0xfffffe2ffca2c910
> > null_vptocnp() at null_vptocnp+0x2b/frame 0xfffffe2ffca2c970
> > VOP_VPTOCNP_APV() at VOP_VPTOCNP_APV+0x98/frame 0xfffffe2ffca2c9a0
> > vn_vptocnp_locked() at vn_vptocnp_locked+0x113/frame 0xfffffe2ffca2ca20
> > vn_fullpath1() at vn_fullpath1+0x1b2/frame 0xfffffe2ffca2ca80
> > kern___getcwd() at kern___getcwd+0x115/frame 0xfffffe2ffca2cae0
> > amd64_syscall() at amd64_syscall+0x357/frame 0xfffffe2ffca2cbf0
> > Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfffffe2ffca2cbf0
> > --- syscall (326, FreeBSD ELF64, sys___getcwd), rip = 0x2000cd8cda, rsp = 0x7fffffffd358, rbp = 0x7fffffffd440 ---
> > 
> > 
> > 
> > 
> > On 08/29/2014 11:24 AM, John Baldwin wrote:
> >> On Wednesday, August 27, 2014 02:24:47 PM Daniel Andersen wrote:
> >>> On 08/27/2014 01:39 PM, John Baldwin wrote:
> >>>> These are all blocked in "zfs" then.  (For future reference, the 'mwchan'
> >>>> field that you see as 'STATE' in top or via 'ps O mwchan' is more detailed
> >>>> than the 'D' state.)
> >>>>
> >>>> To diagnose this further, you would need to see which thread holds the
> >>>> ZFS vnode lock these threads need.  I have some gdb scripts you can use to
> >>>> do that at www.freebsd.org/~jhb/gdb/.  You would want to download 'gdb6*'
> >>>> files from there and then do this as root:
> >>>>
> >>>> # cd /path/to/gdb/files
> >>>> # kgdb
> >>>> (kgdb) source gdb6
> >>>> (kgdb) sleepchain 42335
> >>>>
> >>>> Where '42335' is the pid of some process stuck in "zfs".
> >>>
> >>> I will keep this in mind the next time the machine wedges.  Another data
> >>> point: the second procstat output I sent was the most recent.  All the
> >>> processes listed there were after the fact.  The process that started the
> >>> entire problem ( this time ) was sudo, and it only has this one entry in
> >>> procstat:
> >>>
> >>> 38003 102797 sudo             -                <running>
> >>>
> >>> Of note, this does not appear to be blocked on zfs in anyway.  'ps' showed
> >>> it in 'R' state instead of 'D' ( I will be sure to use mwchan in the
> >>> future. ) It appeared to be pegging an entire CPU core at 100% usage, as
> >>> well, and was only single threaded.
> >>
> >> Well, if it is spinning in some sort of loop in the kernel while holding a
> >> ZFS vnode lock that could be blocking all the other threads.  In that case,
> >> you don't need to do what I asked for above.  Instead, we need to find out
> >> what that thread is doing.  There are two ways of doing this.  One is to
> >> force a panic via 'sysctl debug.kdb.panic=1' and then use kgdb on the
> >> crashdump to determine what the running thread is doing.  Another option
> >> is to break into the DDB debugger on the console (note that you will need
> >> to build a custom kernel with DDB if you are on stable) and request a
> >> stack trace of the running process via 'tr <pid>'.  Ideally you can do this
> >> over a serial console so you can just cut and paste the output of the trace
> >> into a mail.  Over a video console you can either transcribe it by hand or
> >> take photos.
> >>
> > 
> > _______________________________________________
> > freebsd-fs at freebsd.org mailing list
> > http://lists.freebsd.org/mailman/listinfo/freebsd-fs
> > To unsubscribe, send any mail to "freebsd-fs-unsubscribe at freebsd.org"
> > 
> 
> _______________________________________________
> freebsd-fs at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-fs
> To unsubscribe, send any mail to "freebsd-fs-unsubscribe at freebsd.org"


More information about the freebsd-fs mailing list