ZFS, compression, system load, pauses (livelocks?)
Ivan Voras
ivoras at freebsd.org
Tue Dec 15 12:39:29 UTC 2009
The context of this post is file servers running FreeBSD 8 and ZFS with
compressed file systems on low-end hardware, or actually high-end
hardware on VMWare ESX 3.5 and 4, which kind of makes it low-end as far
as storage is concerned. The servers are standby backup mirrors of
production servers - thus many writes, few reads.
Running this setup I notice two things:
1) load averages get very high, though the only usage these systems get
is file system usage:
last pid: 2270; load averages: 19.02, 14.58, 9.07
up 0+09:47:03 11:29:04
2) long pauses, in what looks like vfs.zfs.txg.timeout second intervals,
which seemengly block everything, or at least the entire userland. These
pauses are sometimes so long that file transfers fail, which must be
avoided.
I think these two are connected. Monitoring the system with "top" and
"iostat" reveals that the state between the pauses are mostly idle (data
is being sent to the server over a gbit network in rates of 15+ MB/s).
During the pauses there is heavy IO activity which reflects both in top
- kernel threads spa_zio_* (ZFS taskqueues) are hogging the CPU and
immediately after the pause iostat reveals several tens of MB written to
the drives.
Except for the pause, this is expected - ZFS is compressing data before
writing it down. The pauses are interesting. Immediately after such
pause the system status is similar to this one:
91 processes: 12 running, 63 sleeping, 16 waiting
CPU: 1.4% user, 0.0% nice, 96.3% system, 0.3% interrupt, 2.0% idle
Mem: 75M Active, 122M Inact, 419M Wired, 85M Buf, 125M Free
(this is the first "top" output after a pause).
Looking at the list of processes it looks like a large number of kernel
and userland processes are woken up at once. From the kernel side there
are regularily all g_* threads, but also unrelated threads like
bufdaemon, softdepflush, etc. and from the userland - top, syslog, cron,
etc. It is like ZFS livelocks everything else.
The effects of this can be lessened by reducing vfs.zfs.txg.timeout,
vfs.zfs.vdev.max_pending and using the attached patch which creates NCPU
ZFS worker threads instead of hardcoding them to "8". The patch will
probably also help the high-end hardware end of the spectrum, where
16-core users will finally be able to dedicate them all to ZFS :)
With these measures I have reduced pauses to a second or two every 10
seconds instead of up to tens of seconds every 30 seconds, which is good
enough so transfers don't timeout, but could probably be better.
Any ideas on the "pauses" issue?
The taskq-thread patch is below. If nobody objects (pjd? I don't know
how harder will it make it for you to import future ZFS versions?) I
will commit it soon.
--- /sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c 2009-03-29
01:31:42.000000000 +0100
+++ spa.c 2009-12-15 13:36:05.000000000 +0100
@@ -58,15 +58,16 @@
#include <sys/callb.h>
#include <sys/sunddi.h>
#include <sys/spa_boot.h>
+#include <sys/smp.h>
#include "zfs_prop.h"
#include "zfs_comutil.h"
-int zio_taskq_threads[ZIO_TYPES][ZIO_TASKQ_TYPES] = {
+static int zio_taskq_threads[ZIO_TYPES][ZIO_TASKQ_TYPES] = {
/* ISSUE INTR */
{ 1, 1 }, /* ZIO_TYPE_NULL */
- { 1, 8 }, /* ZIO_TYPE_READ */
- { 8, 1 }, /* ZIO_TYPE_WRITE */
+ { 1, -1 }, /* ZIO_TYPE_READ */
+ { -1, 1 }, /* ZIO_TYPE_WRITE */
{ 1, 1 }, /* ZIO_TYPE_FREE */
{ 1, 1 }, /* ZIO_TYPE_CLAIM */
{ 1, 1 }, /* ZIO_TYPE_IOCTL */
@@ -498,7 +499,8 @@
for (int t = 0; t < ZIO_TYPES; t++) {
for (int q = 0; q < ZIO_TASKQ_TYPES; q++) {
spa->spa_zio_taskq[t][q] = taskq_create("spa_zio",
- zio_taskq_threads[t][q], maxclsyspri, 50,
+ zio_taskq_threads[t][q] == -1 ? mp_ncpus : zio_taskq_threads[t][q],
+ maxclsyspri, 50,
INT_MAX, TASKQ_PREPOPULATE);
}
}
More information about the freebsd-fs
mailing list