ZFS hanging
Dennis Glatting
dg at pki2.com
Fri Jul 13 05:34:11 UTC 2012
On Thu, 2012-07-12 at 15:15 +0200, Fabian Keil wrote:
> Dennis Glatting <freebsd at pki2.com> wrote:
>
> > I have a ZFS array of disks where the system simply stops as if forever
> > blocked by some IO mutex. This happens often and the following is the
> > output of top:
> >
> > last pid: 6075; load averages: 0.00, 0.00, 0.00 up 0+16:54:41
> > 13:04:10
> > 135 processes: 1 running, 134 sleeping
> > CPU: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle
> > Mem: 47M Active, 24M Inact, 18G Wired, 120M Buf, 44G Free
> > Swap: 32G Total, 32G Free
> >
> > PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU
> > COMMAND
> > 2410 root 1 33 0 11992K 2820K zio->i 7 331:25 0.00%
> > bzip2
> > 2621 root 1 52 4 28640K 5544K tx->tx 24 245:33 0.00%
> > john
> > 2624 root 1 48 4 28640K 5544K tx->tx 4 239:08 0.00%
> > john
> > 2623 root 1 49 4 28640K 5544K tx->tx 7 238:44 0.00%
> > john
>
> Does top continue to run or does it hang as well?
>
It continues to run.
> I believe the locks shown above shouldn't affect already
> running applications that don't cause disk traffic.
>
> > My question is what should I be looking at and how should I look at it?
> > There is nothing in the logs or the console, rather the system is
> > forever paused and entering commands results in no response (it's as if
> > everything is deadlocked).
>
> If the entered commands actually start you can try sending
> SIGINFO with CTRL+T to request a status:
>
> fk at r500 ~ $zpool status
> load: 0.15 cmd: zpool 2698 [spa_namespace_lock] 543.23r 0.00u 0.12s 0% 2908k
>
> If you can run procstat you can try getting kernel stack traces
> for some (or all) processes that should give you a rough idea
> of how the lock is reached:
>
> fk at r500 ~ $procstat -kk $(pgrep zpool)
> PID TID COMM TDNAME KSTACK
> 2698 100922 zpool - mi_switch+0x196 sleepq_wait+0x42 _sx_xlock_hard+0x525 _sx_xlock+0x75 spa_all_configs+0x5e zfs_ioc_pool_configs+0x29 zfsdev_ioctl+0xe6 devfs_ioctl_f+0x7b kern_ioctl+0x115 sys_ioctl+0xfd amd64_syscall+0x5f9 Xfast_syscall+0xf7
> 2388 100431 zpool - mi_switch+0x196 sleepq_wait+0x42 _sx_xlock_hard+0x525 _sx_xlock+0x75 spa_open_common+0x7a spa_get_stats+0x5b zfs_ioc_pool_stats+0x2c zfsdev_ioctl+0xe6 devfs_ioctl_f+0x7b kern_ioctl+0x115 sys_ioctl+0xfd amd64_syscall+0x5f9 Xfast_syscall+0xf7
>
> If procstat hangs as well you could try executing it in a loop
> before the problem occurs. If it stays in the cache it may keep
> running after most of the other processes stop responding.
>
> DTrace can be useful for analyzing locking issues as well:
>
> fk at r500 ~ $sudo ~/scripts/flowtrace.d zfsdev_ioctl
> 0 101057 2012 Jul 12 14:59:42.155 00: Trace in progress. Waiting to enter zfsdev_ioctl. Hit CTRL-C to exit.
> 0 100979 2012 Jul 12 14:59:43.933 04: --> zfsdev_ioctl:entry
> 0 100979 2012 Jul 12 14:59:43.933 05: --> zfs_secpolicy_none:entry
> 0 100979 2012 Jul 12 14:59:43.933 05: <-- zfs_secpolicy_none:return
> 0 100979 2012 Jul 12 14:59:43.933 05: --> zfs_ioc_pool_configs:entry
> 0 100979 2012 Jul 12 14:59:43.933 06: --> spa_all_configs:entry
> 0 100979 2012 Jul 12 14:59:43.933 07: --> nvlist_alloc:entry
> 0 100979 2012 Jul 12 14:59:43.933 07: [...]
> 0 100979 2012 Jul 12 14:59:43.933 07: <-- nvlist_alloc:return
> 0 100979 2012 Jul 12 14:59:43.933 07: --> _sx_xlock:entry
> 0 100979 2012 Jul 12 14:59:43.933 08: --> _sx_xlock_hard:entry
> 0 100979 2012 Jul 12 14:59:43.933 09: --> sleepq_lock:entry
> 0 100979 2012 Jul 12 14:59:43.933 10: --> spinlock_enter:entry
> 0 100979 2012 Jul 12 14:59:43.933 10: --> critical_enter:entry
> 0 100979 2012 Jul 12 14:59:43.933 10: <-- critical_enter:return
> 0 100979 2012 Jul 12 14:59:43.933 09: <-- sleepq_lock:return
> 0 100979 2012 Jul 12 14:59:43.933 09: --> lockstat_nsecs:entry
> 0 100979 2012 Jul 12 14:59:43.933 10: --> binuptime:entry
> 0 100979 2012 Jul 12 14:59:43.933 11: --> hpet_get_timecount:entry
> 0 100979 2012 Jul 12 14:59:43.933 11: <-- hpet_get_timecount:return
> 0 100979 2012 Jul 12 14:59:43.933 10: <-- binuptime:return
> 0 100979 2012 Jul 12 14:59:43.933 09: <-- lockstat_nsecs:return
> 0 100979 2012 Jul 12 14:59:43.933 09: --> sleepq_add:entry
> 0 100979 2012 Jul 12 14:59:43.933 10: --> sleepq_lookup:entry
> 0 100979 2012 Jul 12 14:59:43.933 10: <-- sleepq_lookup:return
> 0 100979 2012 Jul 12 14:59:43.933 10: --> thread_lock_flags_:entry
> 0 100979 2012 Jul 12 14:59:43.933 11: --> spinlock_enter:entry
> 0 100979 2012 Jul 12 14:59:43.933 11: --> critical_enter:entry
> 0 100979 2012 Jul 12 14:59:43.933 11: <-- critical_enter:return
> 0 100979 2012 Jul 12 14:59:43.933 10: <-- thread_lock_flags_:return
> 0 100979 2012 Jul 12 14:59:43.933 09: --> spinlock_exit:entry
> 0 100979 2012 Jul 12 14:59:43.933 10: --> critical_exit:entry
> 0 100979 2012 Jul 12 14:59:43.933 10: <-- critical_exit:return
> 0 100979 2012 Jul 12 14:59:43.933 09: <-- spinlock_exit:return
> 0 100979 2012 Jul 12 14:59:43.933 09: --> sleepq_wait:entry
> 0 100979 2012 Jul 12 14:59:43.933 10: --> thread_lock_flags_:entry
> 0 100979 2012 Jul 12 14:59:43.933 11: --> spinlock_enter:entry
> 0 100979 2012 Jul 12 14:59:43.933 11: --> critical_enter:entry
> 0 100979 2012 Jul 12 14:59:43.933 11: <-- critical_enter:return
> 0 100979 2012 Jul 12 14:59:43.933 10: <-- thread_lock_flags_:return
> 0 100979 2012 Jul 12 14:59:43.933 10: --> sleepq_switch:entry
> 0 100979 2012 Jul 12 14:59:43.933 11: --> sched_sleep:entry
> 0 100979 2012 Jul 12 14:59:43.933 11: <-- sched_sleep:return
> 0 100979 2012 Jul 12 14:59:43.933 11: --> thread_lock_set:entry
> 0 100979 2012 Jul 12 14:59:43.933 11: --> spinlock_exit:entry
> 0 100979 2012 Jul 12 14:59:43.933 12: --> critical_exit:entry
> 0 100979 2012 Jul 12 14:59:43.933 12: <-- critical_exit:return
> 0 100979 2012 Jul 12 14:59:43.933 11: <-- spinlock_exit:return
> 0 100979 2012 Jul 12 14:59:43.933 10: --> mi_switch:entry
> 0 100979 2012 Jul 12 14:59:43.933 11: --> rdtsc:entry
> 0 100979 2012 Jul 12 14:59:43.933 11: <-- rdtsc:return
> 0 100979 2012 Jul 12 14:59:43.933 11: --> sched_switch:entry
> 0 100979 2012 Jul 12 14:59:43.933 12: --> sched_pctcpu_update:entry
> 0 100979 2012 Jul 12 14:59:43.933 12: <-- sched_pctcpu_update:return
> 0 100979 2012 Jul 12 14:59:43.933 12: --> spinlock_enter:entry
> 0 100979 2012 Jul 12 14:59:43.933 12: --> critical_enter:entry
> 0 100979 2012 Jul 12 14:59:43.933 12: <-- critical_enter:return
> 0 100979 2012 Jul 12 14:59:43.933 12: --> thread_lock_block:entry
> 0 100979 2012 Jul 12 14:59:43.933 13: --> spinlock_exit:entry
> 0 100979 2012 Jul 12 14:59:43.933 14: --> critical_exit:entry
> 0 100979 2012 Jul 12 14:59:43.933 14: <-- critical_exit:return
> 0 100979 2012 Jul 12 14:59:43.933 13: <-- spinlock_exit:return
> 0 100979 2012 Jul 12 14:59:43.933 12: <-- thread_lock_block:return
> 0 100979 2012 Jul 12 14:59:43.933 12: --> tdq_load_rem:entry
> 0 100979 2012 Jul 12 14:59:43.933 12: <-- tdq_load_rem:return
> 0 100979 2012 Jul 12 14:59:43.933 12: --> choosethread:entry
> 0 100979 2012 Jul 12 14:59:43.933 13: --> sched_choose:entry
> 0 100979 2012 Jul 12 14:59:43.933 14: --> tdq_choose:entry
> 0 100979 2012 Jul 12 14:59:43.933 15: --> runq_choose:entry
> 0 100979 2012 Jul 12 14:59:43.933 15: <-- runq_choose:return
> 0 100979 2012 Jul 12 14:59:43.933 15: --> runq_choose_from:entry
> 0 100979 2012 Jul 12 14:59:43.933 15: <-- runq_choose_from:return
> 0 100979 2012 Jul 12 14:59:43.933 14: --> runq_choose:entry
> 0 100979 2012 Jul 12 14:59:43.933 14: <-- runq_choose:return
> 0 100979 2012 Jul 12 14:59:43.933 13: <-- sched_choose:return
> 0 100979 2012 Jul 12 14:59:43.933 12: <-- choosethread:return
> 0 100979 2012 Jul 12 14:59:43.933 12: --> sched_pctcpu_update:entry
> 0 100979 2012 Jul 12 14:59:43.933 12: <-- sched_pctcpu_update:return
>
> Of course once you know where ZFS is hanging you still have to figure
> out the why ...
>
Thanks!
I've made several changes to the volumes and rerunning my stuff now
(started an hour ago). Rather than having a single volume of hodgepodge
disks I now span two RAIDz pools of similar disks, partly because I was
curious and partly because I thought it'd be fun. I also asserted
ashift=12 and I have replaced/upgraded some of the disks.
I have found, but not run, a bit of Seagate code that forces 4k sectors
on the ST32000542AS disks -- they report 512 byte sectors but are really
4k. I also ran Seatools against some of my disks.
Array now:
iirc# zpool status disk-1
pool: disk-1
state: ONLINE
scan: none requested
config:
NAME STATE READ WRITE CKSUM
disk-1 ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
da2 ONLINE 0 0 0
da3 ONLINE 0 0 0
da4 ONLINE 0 0 0
da8 ONLINE 0 0 0
raidz1-1 ONLINE 0 0 0
da1 ONLINE 0 0 0
da5 ONLINE 0 0 0
da6 ONLINE 0 0 0
da7 ONLINE 0 0 0
da9 ONLINE 0 0 0
logs
gpt/zil-disk1 ONLINE 0 0 0
cache
ada1 ONLINE 0 0 0
errors: No known data errors
--
Dennis Glatting <dg at pki2.com>
More information about the freebsd-fs
mailing list