kern/169480: [zfs] ZFS stalls on heavy I/O

Tue Jan 29 19:00:02 UTC 2013

The following reply was made to PR kern/169480; it has been noted by GNATS.

From: Jeremy Chadwick <jdc at koitsu.org>
To: Harry Coin <hgcoin at gmail.com>
Cc: bug-followup at FreeBSD.org, levent.serinol at mynet.com
Subject: Re: kern/169480: [zfs] ZFS stalls on heavy I/O
Date: Tue, 29 Jan 2013 10:50:28 -0800

 Re 1,2: that transfer speed (183MBytes/second) sounds much better/much
 more accurate for what's going on.  The speed-limiting factors were
 certainly a small blocksize (512 bytes) used by dd, and using
 /dev/random rather than /dev/zero.  I realise you're probably expecting
 to see something like 480MBytes/second (4 drives * 120MB/sec), but
 that's probably not going to happen on that model of system and with
 that CPU.

 For example, on my Q9550 system described earlier, I can get about this:

 $ dd if=/dev/zero of=testfile bs=64k
 ^C27148+0 records in
 27147+0 records out
 1779105792 bytes transferred in 6.935566 secs (256519186 bytes/sec)

 While "gstat -I500ms" shows each disk going between 60MBytes/sec and
 140MBytes/sec.  "zpool iostat -v data 1" shows between 120-220MBytes/sec
 at the pool level, and showing around 65-110MBytes/sec on a per-disk
 level.

 Anyway, point being, things are faster with a large bs and from a source
 that doesn't churn interrupts.  But don't necessarily "pull a Linux" and
 start doing things like bs=1m -- as I said before, Linux dd is
 different, because the I/O is cached (without --direct), while on
 FreeBSD dd is always direct.

 Re 3: That sounds a bit on the slow side.  I would expect those disks,
 at least during writes, to do more.  If **all** the drives show this
 behaviour consistently in gstat, then you know the issue IS NOT with an
 individual disk, and is instead the issue lies elsewhere.  That rules
 out one piece of the puzzle, and that's good.

 Re 5: Did you mean to type 14MBytes/second, not 14mbits/second?  If so,
 yes, I would agree that's slow.  Scrubbing is not necessarily a good way
 to "benchmark" disks, but I understand for "benchmarking" ZFS it's the
 best you've got to some degree.

 Regarding dd'ing and 512 bytes -- as I described to you in my previous
 mail:

 > This speed will be "bursty" and "sporadic" due to the how ZFS ARC
 > works.  The interval at which "things are flushed to disk" is based on
 > the vfs.zfs.txg.timeout sysctl, which on FreeBSD 9.1-RELEASE should
 > default to 5 (5 seconds).

 This is where your "4 secs or so" magic value comes from.  Please do not
 change this sysctl/value; keep it at 5.

 Finally, your vmstat -i output shows something of concern, UNLESS you
 did this WHILE you had the dd (doesn't matter what block size) going,
 and are using /dev/random or /dev/urandom (same thing on FreeBSD):

 > irq20: hpet0                      620136        328
 > irq259: ahci1                     849746        450

 These interrupt rates are quite high.  hpet0 refers to your event
 timer/clock timer (see kern.eventtimer.choice and kern.eventtimer.timer)
 being HPET, and ahci1 refers to your Intel ICH7 AHCI controller.

 Basically what's happening here is that you're generating a ton of
 interrupts doing dd if=/dev/urandom bs=512.  And it makes perfect sense
 to me why: because /dev/urandom has to harvest entropy from interrupt
 sources (please see random(4) man page), and you're generating a lot of
 interrupts to your AHCI controller for each individual 512-byte write.

 When you say "move a video from one dataset to another", please explain
 what it is you're moving from and to.  Specifically: what filesystems,
 and output from "zfs list".

 If you're moving a file from a ZFS filesystem to another ZFS filesystem
 on the same pool, then please state that.  That may help kernel folks
 figure out where your issue lies.

 At this stage, a kernel developer is going to need to step in and try to
 help you figure out where the actual bottleneck is occurring.  This is
 going to be very difficult/complex/very likely not possible with you
 using nas4free, because you will almost certainly be asked to rebuild
 world/kernel to include some new options and possibly asked to include
 DTrace/CTF support (for real-time debugging).  The situation is tricky.

 It would really help if you would/could remove nas4free from the picture
 and instead just run stock FreeBSD, because as I said, if there are some
 kind of kernel tunings or adjustment values the nas4free folks put in
 place that stock FreeBSD doesn't, those could be harming you.

 I can't be of more help here, I'm sorry to say.  The good news is that
 your disks sound fine.  Kernel developers will need to take this up.

 P.S. -- I would strongly recommend updating your nas4free forum post
 with a link to this conversation in this PR.  IMO, the nas4free people
 need to step up and take responsibility (and that almost certainly means
 talking/working with the FreeBSD folks).

 -- 
 | Jeremy Chadwick                                   jdc at koitsu.org |
 | UNIX Systems Administrator                http://jdc.koitsu.org/ |
 | Mountain View, CA, US                                            |
 | Making life hard for others since 1977.             PGP 4BD6C0CB |