More on odd ZFS not-quite-deadlock

Thu Jan 31 03:51:26 UTC 2013

I posted a few days ago about what I thought was a ZFS-related
almost-deadlock.  I have a bit more information now, but I'm still
puzzled.  Hopefully someone else has seen this before.

While things are in the hung state, a "zfs recv" is running.  It's
receiving an empty snapshot to one of the many datasets on this file
server.  "zfs recv" reports that receiving this particular empty
snapshot takes just about half an hour.  When it finally completes,
everything starts working normally again.  (This particular
replication job will no longer be operational in a few hours, so this
may be the last time I can collect information about the issue for a
while.)  The same "zfs recv" takes only a few seconds 23 hours out of 24.

The kstacks of the processes that appear to possibly be involved look
like this:

  PID    TID COMM             TDNAME           KSTACK                       
    0 100061 kernel           thread taskq     mi_switch+0x196 sleepq_wait+0x42 _sx_slock_hard+0x3bb _sx_slock+0x3d zfs_reclaim_complete+0x38 taskqueue_run_locked+0x85 taskqueue_thread_loop+0x46 fork_exit+0x11f fork_trampoline+0xe 
    7 100215 zfskern          arc_reclaim_thre mi_switch+0x196 sleepq_timedwait+0x42 _cv_timedwait+0x13c arc_reclaim_thread+0x29d fork_exit+0x11f fork_trampoline+0xe 
    7 100216 zfskern          l2arc_feed_threa mi_switch+0x196 sleepq_timedwait+0x42 _cv_timedwait+0x13c l2arc_feed_thread+0x1a8 fork_exit+0x11f fork_trampoline+0xe 
    7 100592 zfskern          txg_thread_enter mi_switch+0x196 sleepq_wait+0x42 _cv_wait+0x121 txg_thread_wait+0x79 txg_quiesce_thread+0xb5 fork_exit+0x11f fork_trampoline+0xe 
    7 100593 zfskern          txg_thread_enter mi_switch+0x196 sleepq_timedwait+0x42 _cv_timedwait+0x13c txg_thread_wait+0x3c txg_sync_thread+0x269 fork_exit+0x11f fork_trampoline+0xe 
    7 100989 zfskern          txg_thread_enter mi_switch+0x196 sleepq_wait+0x42 _cv_wait+0x121 txg_thread_wait+0x79 txg_quiesce_thread+0xb5 fork_exit+0x11f fork_trampoline+0xe 
    7 100990 zfskern          txg_thread_enter mi_switch+0x196 sleepq_timedwait+0x42 _cv_timedwait+0x13c txg_thread_wait+0x3c txg_sync_thread+0x269 fork_exit+0x11f fork_trampoline+0xe 
    7 101355 zfskern          txg_thread_enter mi_switch+0x196 sleepq_wait+0x42 _cv_wait+0x121 txg_thread_wait+0x79 txg_quiesce_thread+0xb5 fork_exit+0x11f fork_trampoline+0xe 
    7 101356 zfskern          txg_thread_enter mi_switch+0x196 sleepq_timedwait+0x42 _cv_timedwait+0x13c txg_thread_wait+0x3c txg_sync_thread+0x269 fork_exit+0x11f fork_trampoline+0xe 
   13 100053 geom             g_event          mi_switch+0x196 sleepq_wait+0x42 _sleep+0x3a8 g_run_events+0x430 fork_exit+0x11f fork_trampoline+0xe 
   13 100054 geom             g_up             mi_switch+0x196 sleepq_wait+0x42 _sleep+0x3a8 g_io_schedule_up+0xd8 g_up_procbody+0x5c fork_exit+0x11f fork_trampoline+0xe 
   13 100055 geom             g_down           mi_switch+0x196 sleepq_wait+0x42 _sleep+0x3a8 g_io_schedule_down+0x20e g_down_procbody+0x5c fork_exit+0x11f fork_trampoline+0xe 
   22 100225 syncer           -                mi_switch+0x196 sleepq_wait+0x42 _cv_wait+0x121 rrw_enter+0xdb zfs_sync+0x63 sync_fsync+0x19d VOP_FSYNC_APV+0x4a sync_vnode+0x15e sched_sync+0x1c5 fork_exit+0x11f fork_trampoline+0xe 

93224 102554 zfs              -                mi_switch+0x196 sleepq_wait+0x42 _cv_wait+0x121 zio_wait+0x61 dbuf_read+0x5e5 dnode_next_offset_level+0x28d dnode_next_offset+0xb9 dmu_object_next+0x3e dsl_dataset_destroy+0x164 dmu_recv_end+0x184 zfs_ioc_recv+0x9f4 zfsdev_ioctl+0xe6 devfs_ioctl_f+0x7b kern_ioctl+0x115 sys_ioctl+0xf0 amd64_syscall+0x5ea Xfast_syscall+0xf7 

[This is the zfs recv process that is applying the replication package
with an empty snapshot.]

93320 102479 df               -                mi_switch+0x196 sleepq_wait+0x42 _cv_wait+0x121 rrw_enter+0xdb zfs_root+0x40 lookup+0xaa6 namei+0x535 kern_statfs+0xa4 sys_statfs+0x37 amd64_syscall+0x5ea Xfast_syscall+0xf7 
[7 more like this]

(I've deleted all of the threads that are clearly waiting for some
unrelated event, such as nanosleep() and select().)

-GAWollman