ZFS scheduling

Sun Apr 25 22:04:01 UTC 2010

>Hi,
>
>I noticed that my system gets very slow when I'm doing some simple but
>intense ZFS operations. For example, I move about 20 Gigabytes of data
>from one data set to another on the same pool, which is a RAIDZ of 3 500
>GB SATA disks. The operations itself runs fast, but meanwhile other
>things get really slow. E.g. opening a application takes 5 times as long
>as before. Also simple operations like 'ls' stall for some seconds which
>they did never before. It already changed a lot when I switched from
>RAIDZ to a mirror with only 2 disks. Memory and CPU don't seem to be the
>issue, I have a quad-core CPU and 8 GB RAM.
>
>I can't get rid of the idea that this has something to do with
>scheduling. The system is absolutely stable and fast. Somehow small I/O
>operations on ZFS seem to have it very difficult to make it through when
>other bigger ones are running. Maybe this has something to do with tuning?
>
>I know my system information is very incomplete, and there could be a
>lot of causes. But anybody knows if this could be an issue with ZFS itself?

Hello

As you do mention, your system information is indeed very incomplete,
making your problem rather hard to diagnose :)

Scheduling, in the traditional sense, is unlikely to be the cause of
your problems, but here's a few things you could look into:

First one is obviously the pool layout, heavy-duty writing on a pool,
consisting of a single raidz vdev is slow (slower than writing to a
mirror, as you already discovered), period. such is the nature of
raidz. Additionally, your problem is magnified by the fact that your
have reads competing with writes since you are reading (I assume) from
the same pool. One approach to alleviating the problem would be to
utilize a pool consisting of 2 or more raidz vdevs in a stripe, like
this:

pool
  raidz
    disc1
    disc2
    disc3
  raidz
    disc4
    disc5
    disc6

The second potential cause of your issues is the system wrongly
guesstimating your optimal TXG commit size. ZFS works in such a
fashion, that it commits data to disk in chunks. How big chunks it
writes at a time it tries to optimize by evaluating your pool IO
bandwidth over time and available RAM. The TXG commits happen with an
interval of 5-30 seconds. The worst case scenario is such, that if the
system misguesses the optimal TXG size, then under heavy write load,
it continues to defer the commit for up to the 30 second timeout and
when it hits the caps, it frantically commits it ALL at once. This can
and most likely will completely starve your read IO on the pool for as
long as the drives choke while committing the TXG.

If you are on 8.0-RELEASE, you could try playing with the
vfs.zfs.txg.timeout= variable in /boot/loader.conf, generally sane
values are 5-30, with 30 being the default. You could also try
adjusting vfs.zfs.vdev.max_pending= down from the default of 35 to a
lower value and see if that helps. AFAIK, 8-STABLE and -HEAD have a
systctl variable which directly allow you to manually set the
preferred TXG size and I've pretty sure I've seen some patches on the
mailing lists to add this functionality to 8.0.

Hope this helps.

- Sincerely,
Dan Naumov