ZFS panic on a RELENG_8 NFS server (Was: panic: spin lock held too long (RELENG_8 from today))

Fri Sep 9 20:10:46 UTC 2011

Hiroki Sato <hrs at freebsd.org> wrote
  in <20110907.094717.2272609566853905102.hrs at allbsd.org>:

hr>  During this investigation an disk has to be replaced and resilvering
hr>  it is now in progress.  A deadlock and a forced reboot after that
hr>  make recovering of the zfs datasets take a long time (for committing
hr>  logs, I think), so I will try to reproduce the deadlock and get a
hr>  core dump after it finished.

 I think I could reproduce the symptoms.  I have no idea about if
 these are exactly the same as occurred on my box before because the
 kernel was replaced with one with some debugging options, but these
 are reproducible at least.

 There are two symptoms.  One is a panic.  A DDB output when the panic
 occurred is the following:

----
Fatal trap 12: page fault while in kernel mode
cpuid = 1; apic id = 01
fault virtual address	= 0x100000040
fault code		= supervisor read data, page not present
instruction pointer	= 0x20:0xffffffff8065b926
stack pointer	        = 0x28:0xffffff8257b94d70
frame pointer	        = 0x28:0xffffff8257b94e10
code segment		= base 0x0, limit 0xfffff, type 0x1b
			= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags	= interrupt enabled, resume, IOPL = 0
current process		= 992 (nfsd: service)
[thread pid 992 tid 100586 ]
Stopped at      witness_checkorder+0x246:       movl    0x40(%r13),%ebx

db> bt
Tracing pid 992 tid 100586 td 0xffffff00595d9000
witness_checkorder() at witness_checkorder+0x246
_sx_slock() at _sx_slock+0x35
dmu_bonus_hold() at dmu_bonus_hold+0x57
zfs_zget() at zfs_zget+0x237
zfs_dirent_lock() at zfs_dirent_lock+0x488
zfs_dirlook() at zfs_dirlook+0x69
zfs_lookup() at zfs_lookup+0x26b
zfs_freebsd_lookup() at zfs_freebsd_lookup+0x81
vfs_cache_lookup() at vfs_cache_lookup+0xf0
VOP_LOOKUP_APV() at VOP_LOOKUP_APV+0x40
lookup() at lookup+0x384
nfsvno_namei() at nfsvno_namei+0x268
nfsrvd_lookup() at nfsrvd_lookup+0xd6
nfsrvd_dorpc() at nfsrvd_dorpc+0x745
nfssvc_program() at nfssvc_program+0x447
svc_run_internal() at svc_run_internal+0x51b
svc_thread_start() at svc_thread_start+0xb
fork_exit() at fork_exit+0x11d
fork_trampoline() at fork_trampoline+0xe
--- trap 0xc, rip = 0x8006a031c, rsp = 0x7fffffffe6c8, rbp = 0x6 ---
----

 The complete output can be found at:

  http://people.allbsd.org/~hrs/zfs_panic_20110909_1/pool-zfs-20110909-1.txt

 Another is getting stuck at ZFS access.  The kernel is running with
 no panic but any access to ZFS datasets causes a program
 non-responsive.  The DDB output can be found at:

  http://people.allbsd.org/~hrs/zfs_panic_20110909_2/pool-zfs-20110909-2.txt

 The trigger for the both was some access to a ZFS dataset from the
 NFS clients.  Because the access pattern was complex I could not
 narrow down what was the culprit, but it seems timing-dependent and
 simply doing "rm -rf" locally on the server can sometimes trigger
 them.

 The crash dump and the kernel can be found at the following URLs:

  panic:
    http://people.allbsd.org/~hrs/zfs_panic_20110909_1/

  no panic but unresponsive:
    http://people.allbsd.org/~hrs/zfs_panic_20110909_2/

  kernel:
    http://people.allbsd.org/~hrs/zfs_panic_20110909_kernel/

-- Hiroki