kern/169480: [zfs] ZFS stalls on heavy I/O
Jeremy Chadwick
jdc at koitsu.org
Tue Jan 29 19:00:02 UTC 2013
The following reply was made to PR kern/169480; it has been noted by GNATS.
From: Jeremy Chadwick <jdc at koitsu.org>
To: Harry Coin <hgcoin at gmail.com>
Cc: bug-followup at FreeBSD.org, levent.serinol at mynet.com
Subject: Re: kern/169480: [zfs] ZFS stalls on heavy I/O
Date: Tue, 29 Jan 2013 10:50:28 -0800
Re 1,2: that transfer speed (183MBytes/second) sounds much better/much
more accurate for what's going on. The speed-limiting factors were
certainly a small blocksize (512 bytes) used by dd, and using
/dev/random rather than /dev/zero. I realise you're probably expecting
to see something like 480MBytes/second (4 drives * 120MB/sec), but
that's probably not going to happen on that model of system and with
that CPU.
For example, on my Q9550 system described earlier, I can get about this:
$ dd if=/dev/zero of=testfile bs=64k
^C27148+0 records in
27147+0 records out
1779105792 bytes transferred in 6.935566 secs (256519186 bytes/sec)
While "gstat -I500ms" shows each disk going between 60MBytes/sec and
140MBytes/sec. "zpool iostat -v data 1" shows between 120-220MBytes/sec
at the pool level, and showing around 65-110MBytes/sec on a per-disk
level.
Anyway, point being, things are faster with a large bs and from a source
that doesn't churn interrupts. But don't necessarily "pull a Linux" and
start doing things like bs=1m -- as I said before, Linux dd is
different, because the I/O is cached (without --direct), while on
FreeBSD dd is always direct.
Re 3: That sounds a bit on the slow side. I would expect those disks,
at least during writes, to do more. If **all** the drives show this
behaviour consistently in gstat, then you know the issue IS NOT with an
individual disk, and is instead the issue lies elsewhere. That rules
out one piece of the puzzle, and that's good.
Re 5: Did you mean to type 14MBytes/second, not 14mbits/second? If so,
yes, I would agree that's slow. Scrubbing is not necessarily a good way
to "benchmark" disks, but I understand for "benchmarking" ZFS it's the
best you've got to some degree.
Regarding dd'ing and 512 bytes -- as I described to you in my previous
mail:
> This speed will be "bursty" and "sporadic" due to the how ZFS ARC
> works. The interval at which "things are flushed to disk" is based on
> the vfs.zfs.txg.timeout sysctl, which on FreeBSD 9.1-RELEASE should
> default to 5 (5 seconds).
This is where your "4 secs or so" magic value comes from. Please do not
change this sysctl/value; keep it at 5.
Finally, your vmstat -i output shows something of concern, UNLESS you
did this WHILE you had the dd (doesn't matter what block size) going,
and are using /dev/random or /dev/urandom (same thing on FreeBSD):
> irq20: hpet0 620136 328
> irq259: ahci1 849746 450
These interrupt rates are quite high. hpet0 refers to your event
timer/clock timer (see kern.eventtimer.choice and kern.eventtimer.timer)
being HPET, and ahci1 refers to your Intel ICH7 AHCI controller.
Basically what's happening here is that you're generating a ton of
interrupts doing dd if=/dev/urandom bs=512. And it makes perfect sense
to me why: because /dev/urandom has to harvest entropy from interrupt
sources (please see random(4) man page), and you're generating a lot of
interrupts to your AHCI controller for each individual 512-byte write.
When you say "move a video from one dataset to another", please explain
what it is you're moving from and to. Specifically: what filesystems,
and output from "zfs list".
If you're moving a file from a ZFS filesystem to another ZFS filesystem
on the same pool, then please state that. That may help kernel folks
figure out where your issue lies.
At this stage, a kernel developer is going to need to step in and try to
help you figure out where the actual bottleneck is occurring. This is
going to be very difficult/complex/very likely not possible with you
using nas4free, because you will almost certainly be asked to rebuild
world/kernel to include some new options and possibly asked to include
DTrace/CTF support (for real-time debugging). The situation is tricky.
It would really help if you would/could remove nas4free from the picture
and instead just run stock FreeBSD, because as I said, if there are some
kind of kernel tunings or adjustment values the nas4free folks put in
place that stock FreeBSD doesn't, those could be harming you.
I can't be of more help here, I'm sorry to say. The good news is that
your disks sound fine. Kernel developers will need to take this up.
P.S. -- I would strongly recommend updating your nas4free forum post
with a link to this conversation in this PR. IMO, the nas4free people
need to step up and take responsibility (and that almost certainly means
talking/working with the FreeBSD folks).
--
| Jeremy Chadwick jdc at koitsu.org |
| UNIX Systems Administrator http://jdc.koitsu.org/ |
| Mountain View, CA, US |
| Making life hard for others since 1977. PGP 4BD6C0CB |
More information about the freebsd-fs
mailing list