NFS deadlock on 9.2-Beta1

Fri Aug 23 13:43:59 UTC 2013

J. David wrote:
> One deadlocked process cropped up overnight, but I managed to panic
> the box before getting too much debugging info. :(
> 
> The process was in state T instead of D, which I guess must be a side
> effect of some of the debugging code compiled in.
> 
> Here are the details I was able to capture:
> 
> db>  show proc 7692
> Process 7692 (httpd) at 0xfffffe0158793000:
>  state: NORMAL
>  uid: 25000  gids: 25000
>  parent: pid 1 at 0xfffffe00039c3950
>  ABI: FreeBSD ELF64
>  arguments: /nfsn/apps/tapache22/bin/httpd
>  threads: 3
> 100674                   D       newnfs   0xfffffe021cdd9848 httpd
> 100597                   D       pgrbwt   0xfffffe02fda788b8 httpd
> 100910                   s                                   httpd
> 
> db> show thread 100674
> Thread 100674 at 0xfffffe0108c79480:
>  proc (pid 7692): 0xfffffe0158793000
>  name: httpd
>  stack: 0xffffff834c80f000-0xffffff834c812fff
>  flags: 0x2a804  pflags: 0
>  state: INHIBITED: {SLEEPING}
>  wmesg: newnfs  wchan: 0xfffffe021cdd9848
>  priority: 96
>  container lock: sleepq chain (0xffffffff813c03c8)
> 
> db> tr 100674
> Tracing pid 7692 tid 100674 td 0xfffffe0108c79480
> sched_switch() at sched_switch+0x234/frame 0xffffff834c812360
> mi_switch() at mi_switch+0x15c/frame 0xffffff834c8123a0
> sleepq_switch() at sleepq_switch+0x17d/frame 0xffffff834c8123e0
> sleepq_wait() at sleepq_wait+0x43/frame 0xffffff834c812410
> sleeplk() at sleeplk+0x11a/frame 0xffffff834c812460
> __lockmgr_args() at __lockmgr_args+0x9a9/frame 0xffffff834c812580
> nfs_lock1() at nfs_lock1+0x87/frame 0xffffff834c8125b0
> VOP_LOCK1_APV() at VOP_LOCK1_APV+0xbe/frame 0xffffff834c8125e0
> _vn_lock() at _vn_lock+0x63/frame 0xffffff834c812640
> ncl_upgrade_vnlock() at ncl_upgrade_vnlock+0x5e/frame
> 0xffffff834c812670
> ncl_bioread() at ncl_bioread+0x195/frame 0xffffff834c8127e0
> VOP_READ_APV() at VOP_READ_APV+0xd1/frame 0xffffff834c812810
> vn_rdwr() at vn_rdwr+0x2bc/frame 0xffffff834c8128d0
> kern_sendfile() at kern_sendfile+0xa90/frame 0xffffff834c812ac0
> do_sendfile() at do_sendfile+0x92/frame 0xffffff834c812b20
> amd64_syscall() at amd64_syscall+0x259/frame 0xffffff834c812c30
> Xfast_syscall() at Xfast_syscall+0xfb/frame 0xffffff834c812c30
> --- syscall (393, FreeBSD ELF64, sys_sendfile), rip = 0x801b26f4c,
> rsp
> = 0x7ffffe9f43c8, rbp = 0x7ffffe9f4700 ---
> 
> db> show lockchain 100674
> thread 100674 (pid 7692, httpd) inhibited
> 
> db> show thread 100597
> Thread 100597 at 0xfffffe021c976000:
>  proc (pid 7692): 0xfffffe0158793000
>  name: httpd
>  stack: 0xffffff834c80a000-0xffffff834c80dfff
>  flags: 0x28804  pflags: 0
>  state: INHIBITED: {SLEEPING}
>  wmesg: pgrbwt  wchan: 0xfffffe02fda788b8
>  priority: 84
>  container lock: sleepq chain (0xffffffff813c0148)
> 
> db> tr 100597
> Tracing pid 7692 tid 100597 td 0xfffffe021c976000
> sched_switch() at sched_switch+0x234/frame 0xffffff834c80d750
> mi_switch() at mi_switch+0x15c/frame 0xffffff834c80d790
> sleepq_switch() at sleepq_switch+0x17d/frame 0xffffff834c80d7d0
> sleepq_wait() at sleepq_wait+0x43/frame 0xffffff834c80d800
> _sleep() at _sleep+0x30f/frame 0xffffff834c80d890
> vm_page_grab() at vm_page_grab+0x120/frame 0xffffff834c80d8d0
> kern_sendfile() at kern_sendfile+0x992/frame 0xffffff834c80dac0
> do_sendfile() at do_sendfile+0x92/frame 0xffffff834c80db20
> amd64_syscall() at amd64_syscall+0x259/frame 0xffffff834c80dc30
> Xfast_syscall() at Xfast_syscall+0xfb/frame 0xffffff834c80dc30
> --- syscall (393, FreeBSD ELF64, sys_sendfile), rip = 0x801b26f4c,
> rsp
> = 0x7ffffebf53c8, rbp = 0x7ffffebf5700 ---
> 
> db> show lockchain 100597
> thread 100597 (pid 7692, httpd) inhibited
> 
> The "inhibited" is not something I'm familiar with and didn't match
> the example output; I thought that maybe the T state was overpowering
> the locks, and that maybe I should continue the system and then -CONT
> the process.  However, a few seconds after I issued "c" at the DDB
> prompt, the system panicked in the console driver ("mtx_lock_spin:
> recursed on non-recursive mutex cnputs_mtx @
> /usr/src/sys/kern/kern_cons.c:500"), so I guess that's not a thing to
> do. :(
> 
> Sorry my stupidity and ignorance is dragging this out. :(  This is
> all
> well outside my comfort zone, but next time I'll get it for sure.
> 
No problem. Thanks for trying to capture this stuff.

Unfortunately, what you have above doesn't tell me anything more about
the problem.
The main question to me is "Why is the thread stuck in "pgrbwt" permanently?".

To figure this out, we need the info on all threads on the system. In particular,
the status (the output of "ps axHl" would be a start, before going into the
debugger) of the "nfsiod" threads might point to the cause, although it may
involve other threads as well.

If you are running a serial console, just start "script" and then type the
commands "ps axHl" followed by going into the debugger and doing the commands
here: (basically everything with "all"):
http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html
(I don't know why your console crashed. Hopefully you can get this done before
 the crash the next time.)
If you do get it all, please include kib at freebsd.org in the cc list as well as
me.

Thanks for trying to do this, rick

> Thanks!
> _______________________________________________
> freebsd-stable at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to
> "freebsd-stable-unsubscribe at freebsd.org"
>