ZFS hanging

Fri Jul 13 05:34:11 UTC 2012

On Thu, 2012-07-12 at 15:15 +0200, Fabian Keil wrote:
> Dennis Glatting <freebsd at pki2.com> wrote:
> 
> > I have a ZFS array of disks where the system simply stops as if forever
> > blocked by some IO mutex. This happens often and the following is the
> > output of top:
> > 
> > last pid:  6075;  load averages:  0.00,  0.00,  0.00    up 0+16:54:41
> > 13:04:10
> > 135 processes: 1 running, 134 sleeping
> > CPU:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
> > Mem: 47M Active, 24M Inact, 18G Wired, 120M Buf, 44G Free
> > Swap: 32G Total, 32G Free
> > 
> >   PID USERNAME    THR PRI NICE   SIZE    RES STATE   C   TIME   WCPU
> > COMMAND
> >  2410 root          1  33    0 11992K  2820K zio->i  7 331:25  0.00%
> > bzip2
> >  2621 root          1  52    4 28640K  5544K tx->tx 24 245:33  0.00%
> > john
> >  2624 root          1  48    4 28640K  5544K tx->tx  4 239:08  0.00%
> > john
> >  2623 root          1  49    4 28640K  5544K tx->tx  7 238:44  0.00%
> > john
> 
> Does top continue to run or does it hang as well?
> 

It continues to run.

> I believe the locks shown above shouldn't affect already
> running applications that don't cause disk traffic.
> 
> > My question is what should I be looking at and how should I look at it?
> > There is nothing in the logs or the console, rather the system is
> > forever paused and entering commands results in no response (it's as if
> > everything is deadlocked).
> 
> If the entered commands actually start you can try sending
> SIGINFO with CTRL+T to request a status:
> 
> fk at r500 ~ $zpool status
> load: 0.15  cmd: zpool 2698 [spa_namespace_lock] 543.23r 0.00u 0.12s 0% 2908k
> 
> If you can run procstat you can try getting kernel stack traces
> for some (or all) processes that should give you a rough idea
> of how the lock is reached:
> 
> fk at r500 ~ $procstat -kk $(pgrep zpool)
>   PID    TID COMM             TDNAME           KSTACK
>  2698 100922 zpool            -                mi_switch+0x196 sleepq_wait+0x42 _sx_xlock_hard+0x525 _sx_xlock+0x75 spa_all_configs+0x5e zfs_ioc_pool_configs+0x29 zfsdev_ioctl+0xe6 devfs_ioctl_f+0x7b kern_ioctl+0x115 sys_ioctl+0xfd amd64_syscall+0x5f9 Xfast_syscall+0xf7
>  2388 100431 zpool            -                mi_switch+0x196 sleepq_wait+0x42 _sx_xlock_hard+0x525 _sx_xlock+0x75 spa_open_common+0x7a spa_get_stats+0x5b zfs_ioc_pool_stats+0x2c zfsdev_ioctl+0xe6 devfs_ioctl_f+0x7b kern_ioctl+0x115 sys_ioctl+0xfd amd64_syscall+0x5f9 Xfast_syscall+0xf7
> 
> If procstat hangs as well you could try executing it in a loop
> before the problem occurs. If it stays in the cache it may keep
> running after most of the other processes stop responding.
> 
> DTrace can be useful for analyzing locking issues as well:
> 
> fk at r500 ~ $sudo ~/scripts/flowtrace.d zfsdev_ioctl
> 0 101057 2012 Jul 12 14:59:42.155 00: Trace in progress. Waiting to enter zfsdev_ioctl. Hit CTRL-C to exit.
> 0 100979 2012 Jul 12 14:59:43.933 04:      --> zfsdev_ioctl:entry
> 0 100979 2012 Jul 12 14:59:43.933 05:       --> zfs_secpolicy_none:entry
> 0 100979 2012 Jul 12 14:59:43.933 05:       <-- zfs_secpolicy_none:return
> 0 100979 2012 Jul 12 14:59:43.933 05:       --> zfs_ioc_pool_configs:entry
> 0 100979 2012 Jul 12 14:59:43.933 06:        --> spa_all_configs:entry
> 0 100979 2012 Jul 12 14:59:43.933 07:         --> nvlist_alloc:entry
> 0 100979 2012 Jul 12 14:59:43.933 07:             [...]
> 0 100979 2012 Jul 12 14:59:43.933 07:         <-- nvlist_alloc:return
> 0 100979 2012 Jul 12 14:59:43.933 07:         --> _sx_xlock:entry
> 0 100979 2012 Jul 12 14:59:43.933 08:          --> _sx_xlock_hard:entry
> 0 100979 2012 Jul 12 14:59:43.933 09:           --> sleepq_lock:entry
> 0 100979 2012 Jul 12 14:59:43.933 10:            --> spinlock_enter:entry
> 0 100979 2012 Jul 12 14:59:43.933 10:            --> critical_enter:entry
> 0 100979 2012 Jul 12 14:59:43.933 10:            <-- critical_enter:return
> 0 100979 2012 Jul 12 14:59:43.933 09:           <-- sleepq_lock:return
> 0 100979 2012 Jul 12 14:59:43.933 09:           --> lockstat_nsecs:entry
> 0 100979 2012 Jul 12 14:59:43.933 10:            --> binuptime:entry
> 0 100979 2012 Jul 12 14:59:43.933 11:             --> hpet_get_timecount:entry
> 0 100979 2012 Jul 12 14:59:43.933 11:             <-- hpet_get_timecount:return
> 0 100979 2012 Jul 12 14:59:43.933 10:            <-- binuptime:return
> 0 100979 2012 Jul 12 14:59:43.933 09:           <-- lockstat_nsecs:return
> 0 100979 2012 Jul 12 14:59:43.933 09:           --> sleepq_add:entry
> 0 100979 2012 Jul 12 14:59:43.933 10:            --> sleepq_lookup:entry
> 0 100979 2012 Jul 12 14:59:43.933 10:            <-- sleepq_lookup:return
> 0 100979 2012 Jul 12 14:59:43.933 10:            --> thread_lock_flags_:entry
> 0 100979 2012 Jul 12 14:59:43.933 11:             --> spinlock_enter:entry
> 0 100979 2012 Jul 12 14:59:43.933 11:             --> critical_enter:entry
> 0 100979 2012 Jul 12 14:59:43.933 11:             <-- critical_enter:return
> 0 100979 2012 Jul 12 14:59:43.933 10:            <-- thread_lock_flags_:return
> 0 100979 2012 Jul 12 14:59:43.933 09:           --> spinlock_exit:entry
> 0 100979 2012 Jul 12 14:59:43.933 10:            --> critical_exit:entry
> 0 100979 2012 Jul 12 14:59:43.933 10:            <-- critical_exit:return
> 0 100979 2012 Jul 12 14:59:43.933 09:           <-- spinlock_exit:return
> 0 100979 2012 Jul 12 14:59:43.933 09:           --> sleepq_wait:entry
> 0 100979 2012 Jul 12 14:59:43.933 10:            --> thread_lock_flags_:entry
> 0 100979 2012 Jul 12 14:59:43.933 11:             --> spinlock_enter:entry
> 0 100979 2012 Jul 12 14:59:43.933 11:             --> critical_enter:entry
> 0 100979 2012 Jul 12 14:59:43.933 11:             <-- critical_enter:return
> 0 100979 2012 Jul 12 14:59:43.933 10:            <-- thread_lock_flags_:return
> 0 100979 2012 Jul 12 14:59:43.933 10:            --> sleepq_switch:entry
> 0 100979 2012 Jul 12 14:59:43.933 11:             --> sched_sleep:entry
> 0 100979 2012 Jul 12 14:59:43.933 11:             <-- sched_sleep:return
> 0 100979 2012 Jul 12 14:59:43.933 11:             --> thread_lock_set:entry
> 0 100979 2012 Jul 12 14:59:43.933 11:             --> spinlock_exit:entry
> 0 100979 2012 Jul 12 14:59:43.933 12:              --> critical_exit:entry
> 0 100979 2012 Jul 12 14:59:43.933 12:              <-- critical_exit:return
> 0 100979 2012 Jul 12 14:59:43.933 11:             <-- spinlock_exit:return
> 0 100979 2012 Jul 12 14:59:43.933 10:            --> mi_switch:entry
> 0 100979 2012 Jul 12 14:59:43.933 11:             --> rdtsc:entry
> 0 100979 2012 Jul 12 14:59:43.933 11:             <-- rdtsc:return
> 0 100979 2012 Jul 12 14:59:43.933 11:             --> sched_switch:entry
> 0 100979 2012 Jul 12 14:59:43.933 12:              --> sched_pctcpu_update:entry
> 0 100979 2012 Jul 12 14:59:43.933 12:              <-- sched_pctcpu_update:return
> 0 100979 2012 Jul 12 14:59:43.933 12:              --> spinlock_enter:entry
> 0 100979 2012 Jul 12 14:59:43.933 12:              --> critical_enter:entry
> 0 100979 2012 Jul 12 14:59:43.933 12:              <-- critical_enter:return
> 0 100979 2012 Jul 12 14:59:43.933 12:              --> thread_lock_block:entry
> 0 100979 2012 Jul 12 14:59:43.933 13:               --> spinlock_exit:entry
> 0 100979 2012 Jul 12 14:59:43.933 14:                --> critical_exit:entry
> 0 100979 2012 Jul 12 14:59:43.933 14:                <-- critical_exit:return
> 0 100979 2012 Jul 12 14:59:43.933 13:               <-- spinlock_exit:return
> 0 100979 2012 Jul 12 14:59:43.933 12:              <-- thread_lock_block:return
> 0 100979 2012 Jul 12 14:59:43.933 12:              --> tdq_load_rem:entry
> 0 100979 2012 Jul 12 14:59:43.933 12:              <-- tdq_load_rem:return
> 0 100979 2012 Jul 12 14:59:43.933 12:              --> choosethread:entry
> 0 100979 2012 Jul 12 14:59:43.933 13:               --> sched_choose:entry
> 0 100979 2012 Jul 12 14:59:43.933 14:                --> tdq_choose:entry
> 0 100979 2012 Jul 12 14:59:43.933 15:                 --> runq_choose:entry
> 0 100979 2012 Jul 12 14:59:43.933 15:                 <-- runq_choose:return
> 0 100979 2012 Jul 12 14:59:43.933 15:                 --> runq_choose_from:entry
> 0 100979 2012 Jul 12 14:59:43.933 15:                 <-- runq_choose_from:return
> 0 100979 2012 Jul 12 14:59:43.933 14:                --> runq_choose:entry
> 0 100979 2012 Jul 12 14:59:43.933 14:                <-- runq_choose:return
> 0 100979 2012 Jul 12 14:59:43.933 13:               <-- sched_choose:return
> 0 100979 2012 Jul 12 14:59:43.933 12:              <-- choosethread:return
> 0 100979 2012 Jul 12 14:59:43.933 12:              --> sched_pctcpu_update:entry
> 0 100979 2012 Jul 12 14:59:43.933 12:              <-- sched_pctcpu_update:return
> 
> Of course once you know where ZFS is hanging you still have to figure
> out the why ...
> 

Thanks!

I've made several changes to the volumes and rerunning my stuff now
(started an hour ago). Rather than having a single volume of hodgepodge
disks I now span two RAIDz pools of similar disks, partly because I was
curious and partly because I thought it'd be fun. I also asserted
ashift=12 and I have replaced/upgraded some of the disks.

I have found, but not run, a bit of Seagate code that forces 4k sectors
on the ST32000542AS disks -- they report 512 byte sectors but are really
4k. I also ran Seatools against some of my disks.

Array now:

iirc#  zpool status disk-1
  pool: disk-1
 state: ONLINE
  scan: none requested
config:

	NAME             STATE     READ WRITE CKSUM
	disk-1           ONLINE       0     0     0
	  raidz1-0       ONLINE       0     0     0
	    da2          ONLINE       0     0     0
	    da3          ONLINE       0     0     0
	    da4          ONLINE       0     0     0
	    da8          ONLINE       0     0     0
	  raidz1-1       ONLINE       0     0     0
	    da1          ONLINE       0     0     0
	    da5          ONLINE       0     0     0
	    da6          ONLINE       0     0     0
	    da7          ONLINE       0     0     0
	    da9          ONLINE       0     0     0
	logs
	  gpt/zil-disk1  ONLINE       0     0     0
	cache
	  ada1           ONLINE       0     0     0

errors: No known data errors

-- 
Dennis Glatting <dg at pki2.com>