zfs hang on zfs access

Mon Aug 1 16:49:06 UTC 2011

Hi!

Lots of processes hang when accessing one specific zfs data set. All
other sets are fine.
This looks similar to the pool-hang report by Steven Hartland last week,
but the machine in question has no high uptime:

(Pasted as quote to avoid line-break)
> last pid:   274;  load averages:  0.00,  0.48,  1.21     up 6+13:06:28  17:35:29
> 71 processes:  1 running, 70 sleeping
> CPU:  0.2% user,  0.0% nice,  0.2% system,  0.0% interrupt, 99.6% idle
> Mem: 26M Active, 18M Inact, 5018M Wired, 1512K Cache, 597M Buf, 617M Free
> Swap: 4096M Total, 27M Used, 4069M Free
> 
>   PID USERNAME    THR PRI NICE   SIZE    RES STATE   C   TIME   WCPU COMMAND
>   274 root          1  44    0  9376K  2864K CPU0    1   0:00  0.29% top
> 99902 root          1  44    0 29152K  5440K h->has  1   0:00  0.00% smbd
> 99922 root          1  44    0 29152K  5056K h->has  1   0:00  0.00% smbd
> 99995 root          1  44    0 29152K  5072K zfs     1   0:00  0.00% smbd
> 99958 root          1  44    0 29152K  5056K zfsvfs  1   0:00  0.00% smbd
> 99976 root          1  45    0 29152K  4824K zfs     1   0:00  0.00% smbd
> 99946 root          1  44    0 29152K  4692K zfsvfs  1   0:00  0.00% smbd
> 99949 root          1  44    0 29152K  4700K zfsvfs  1   0:00  0.00% smbd
>   205 root          1  44    0 29152K  4800K lockf   1   0:00  0.00% smbd
>   219 root          1  44    0 29152K  4748K lockf   1   0:00  0.00% smbd
> 99932 root          1  44    0 29152K  4704K zfsvfs  1   0:00  0.00% smbd
> 99921 root          1  44    0 29152K  4584K zfs     1   0:00  0.00% smbd
> 99951 root          1  44    0 29152K  4688K h->has  1   0:00  0.00% smbd
>   229 root          1  44    0 29152K  4760K lockf   1   0:00  0.00% smbd
>   218 root          1  44    0 29152K  4720K lockf   1   0:00  0.00% smbd
> 99952 root          1  44    0 29152K  4688K h->has  1   0:00  0.00% smbd
> 99956 root          1  44    0 28816K  4212K zfs     1   0:00  0.00% smbd
> 99978 root          1  44    0 29152K  4740K zfs     1   0:00  0.00% smbd
>   199 backuppc      1  44    0 12608K  2820K h->has  1   0:00  0.00% perl5.8.9
> 99945 root          1  44    0 28816K  4212K zfs     1   0:00  0.00% smbd
>   105 root          1  44    0 29152K  4652K zfs     1   0:00  0.00% smbd
>   181 root          1  44    0 28816K  4160K zfs     1   0:00  0.00% smbd
>   178 root          1  44    0 28816K  4156K zfs     1   0:00  0.00% smbd
> 99950 root          1  44    0 28816K  4208K zfs     1   0:00  0.00% smbd
>   257 jo            1  52    0  8252K  1564K zfs     1   0:00  0.00% ls
> 99929 root          1  44    0 28816K  4204K zfs     1   0:00  0.00% smbd
>   108 root          1  45    0 29152K  4620K zfs     1   0:00  0.00% smbd
>   177 root          1  76    0 17976K  2308K tx->tx  1   0:00  0.00% zfs
>   136 root          1  76    0  2764K  1048K wait    0   0:00  0.00% lockf

I snipped off the ones that look harmless, i.e. the above processes look
strange to me. All the smbd are a result of Windows laptops trying to
access the share that lives on the hang-causing zfs dataset.

> #procstat -kk 257
>   PID    TID COMM             TDNAME           KSTACK
>   257 100335 ls               -                mi_switch+0x1c2 sleepq_switch+0xdc sleepq_wait+0x45 __lockmgr_args+0x8e2 vop_stdlock+0x51 VOP_LOCK1_APV+0x55 _vn_lock+0x48 cache_lookup+0x63f vfs_cache_lookup+0xad VOP_LOOKUP_APV+0x53 lookup+0x624 namei+0x597 vn_open_cred+0x340 vn_open+0x1c kern_openat+0x163 kern_open+0x19 open+0x18 syscallenter+0x2fe

> #procstat -kk 136
>   PID    TID COMM             TDNAME           KSTACK
>   136 100264 lockf            -                mi_switch+0x1c2 sleepq_switch+0xdc sleepq_catch_signals+0x57 sleepq_wait_sig+0xc _sleep+0x26e kern_wait+0xeda wait4+0x37 syscallenter+0x2fe syscall+0x41 Xfast_syscall+0xe2

This is on:
> FreeBSD XXX 8.2-STABLE FreeBSD 8.2-STABLE #0 r224227: Wed Jul 20 16:55:23 BST 2011     root at XXX:/usr/obj/usr/src/sys/GENERIC  amd64

Any zfs list or similar hangs as well. Last night's scrub finished
without any errors.

It is likely that the hang occured during an hourly snapshot (no more
log entries about recent snapshots).

Ideas?

Johannes