ZFS hanging

Dennis Glatting freebsd at pki2.com
Mon Jul 9 20:13:17 UTC 2012


I have a ZFS array of disks where the system simply stops as if forever
blocked by some IO mutex. This happens often and the following is the
output of top:

last pid:  6075;  load averages:  0.00,  0.00,  0.00    up 0+16:54:41
13:04:10
135 processes: 1 running, 134 sleeping
CPU:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
Mem: 47M Active, 24M Inact, 18G Wired, 120M Buf, 44G Free
Swap: 32G Total, 32G Free

  PID USERNAME    THR PRI NICE   SIZE    RES STATE   C   TIME   WCPU
COMMAND
 2410 root          1  33    0 11992K  2820K zio->i  7 331:25  0.00%
bzip2
 2621 root          1  52    4 28640K  5544K tx->tx 24 245:33  0.00%
john
 2624 root          1  48    4 28640K  5544K tx->tx  4 239:08  0.00%
john
 2623 root          1  49    4 28640K  5544K tx->tx  7 238:44  0.00%
john
 2640 root          1  42    4 28640K  5420K tx->tx 23 206:51  0.00%
john
 2638 root          1  42    4 28640K  5420K tx->tx 28 206:34  0.00%
john
 2639 root          1  42    4 28640K  5420K tx->tx  9 206:30  0.00%
john
 2637 root          1  42    4 28640K  5420K tx->tx 18 206:24  0.00%
john


This system is presently resilvering a disk but these stops have
happened before.


iirc#  zpool status disk-1
  pool: disk-1
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool
will
	continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Sun Jul  8 13:07:46 2012
        104G scanned out of 12.4T at 1.73M/s, (scan is slow, no
estimated time)
        10.3G resilvered, 0.82% done
config:

	NAME                        STATE     READ WRITE CKSUM
	disk-1                      DEGRADED     0     0     0
	  raidz2-0                  DEGRADED     0     0     0
	    da1                     ONLINE       0     0     0
	    da2                     ONLINE       0     0     0
	    da10                    ONLINE       0     0     0
	    da9                     ONLINE       0     0     0
	    da5                     ONLINE       0     0     0
	    da6                     ONLINE       0     0     0
	    da7                     ONLINE       0     0     0
	    replacing-7             DEGRADED     0     0     0
	      17938531774236227186  UNAVAIL      0     0     0  was /dev/da8
	      da3                   ONLINE       0     0     0  (resilvering)
	    da8                     ONLINE       0     0     0
	    da4                     ONLINE       0     0     0
	logs
	  ada2p1                    ONLINE       0     0     0
	cache
	  ada1                      ONLINE       0     0     0

errors: No known data errors


This system has dissimilar disks, which I understand should not be a
problem but the stopping also happened before I started the slow disk
upgrade process.

The disks are served by:

* A LSI 9211 flashed to IT, and
* A LSI 2008 controller on the motherboard also flashed to IT.

The 2008 BIOS and firmware is the most recent from LSI. The motherboard
is a Supermicro H8DG6-F.


My question is what should I be looking at and how should I look at it?
There is nothing in the logs or the console, rather the system is
forever paused and entering commands results in no response (it's as if
everything is deadlocked).







More information about the freebsd-fs mailing list