ZFS, compression, system load, pauses (livelocks?)

Tue Dec 15 12:39:29 UTC 2009

The context of this post is file servers running FreeBSD 8 and ZFS with 
compressed file systems on low-end hardware, or actually high-end 
hardware on VMWare ESX 3.5 and 4, which kind of makes it low-end as far 
as storage is concerned. The servers are standby backup mirrors of 
production servers - thus many writes, few reads.

Running this setup I notice two things:

1) load averages get very high, though the only usage these systems get 
is file system usage:

last pid:  2270;  load averages: 19.02, 14.58,  9.07 
 
                                         up 0+09:47:03  11:29:04

2) long pauses, in what looks like vfs.zfs.txg.timeout second intervals, 
which seemengly block everything, or at least the entire userland. These 
pauses are sometimes so long that file transfers fail, which must be 
avoided.

I think these two are connected. Monitoring the system with "top" and 
"iostat" reveals that the state between the pauses are mostly idle (data 
is being sent to the server over a gbit network in rates of 15+ MB/s). 
During the pauses there is heavy IO activity which reflects both in top 
- kernel threads spa_zio_* (ZFS taskqueues) are hogging the CPU and 
immediately after the pause iostat reveals several tens of MB written to 
the drives.

Except for the pause, this is expected - ZFS is compressing data before 
writing it down. The pauses are interesting. Immediately after such 
pause the system status is similar to this one:

91 processes:  12 running, 63 sleeping, 16 waiting
CPU:  1.4% user,  0.0% nice, 96.3% system,  0.3% interrupt,  2.0% idle
Mem: 75M Active, 122M Inact, 419M Wired, 85M Buf, 125M Free

(this is the first "top" output after a pause).

Looking at the list of processes it looks like a large number of kernel 
and userland processes are woken up at once. From the kernel side there 
are regularily all g_* threads, but also unrelated threads like 
bufdaemon, softdepflush, etc. and from the userland - top, syslog, cron, 
etc. It is like ZFS livelocks everything else.

The effects of this can be lessened by reducing vfs.zfs.txg.timeout, 
vfs.zfs.vdev.max_pending and using the attached patch which creates NCPU 
ZFS worker threads instead of hardcoding them to "8". The patch will 
probably also help the high-end hardware end of the spectrum, where 
16-core users will finally be able to dedicate them all to ZFS :)

With these measures I have reduced pauses to a second or two every 10 
seconds instead of up to tens of seconds every 30 seconds, which is good 
enough so transfers don't timeout, but could probably be better.

Any ideas on the "pauses" issue?


The taskq-thread patch is below. If nobody objects (pjd? I don't know 
how harder will it make it for you to import future ZFS versions?) I 
will commit it soon.

--- /sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c	2009-03-29 
01:31:42.000000000 +0100
+++ spa.c	2009-12-15 13:36:05.000000000 +0100
@@ -58,15 +58,16 @@
  #include <sys/callb.h>
  #include <sys/sunddi.h>
  #include <sys/spa_boot.h>
+#include <sys/smp.h>

  #include "zfs_prop.h"
  #include "zfs_comutil.h"

-int zio_taskq_threads[ZIO_TYPES][ZIO_TASKQ_TYPES] = {
+static int zio_taskq_threads[ZIO_TYPES][ZIO_TASKQ_TYPES] = {
  	/*	ISSUE	INTR					*/
  	{	1,	1	},	/* ZIO_TYPE_NULL	*/
-	{	1,	8	},	/* ZIO_TYPE_READ	*/
-	{	8,	1	},	/* ZIO_TYPE_WRITE	*/
+	{	1,	-1	},	/* ZIO_TYPE_READ	*/
+	{	-1,	1	},	/* ZIO_TYPE_WRITE	*/
  	{	1,	1	},	/* ZIO_TYPE_FREE	*/
  	{	1,	1	},	/* ZIO_TYPE_CLAIM	*/
  	{	1,	1	},	/* ZIO_TYPE_IOCTL	*/
@@ -498,7 +499,8 @@
  	for (int t = 0; t < ZIO_TYPES; t++) {
  		for (int q = 0; q < ZIO_TASKQ_TYPES; q++) {
  			spa->spa_zio_taskq[t][q] = taskq_create("spa_zio",
-			    zio_taskq_threads[t][q], maxclsyspri, 50,
+			    zio_taskq_threads[t][q] == -1 ? mp_ncpus : zio_taskq_threads[t][q],
+			    maxclsyspri, 50,
  			    INT_MAX, TASKQ_PREPOPULATE);
  		}
  	}