8.0-RELEASE/amd64 - full ZFS install - low read and write disk performance

Dan Naumov dan.naumov at gmail.com
Sun Jan 24 17:42:23 UTC 2010

On Sun, Jan 24, 2010 at 7:05 PM, Jason Edwards <sub.mesa at gmail.com> wrote:
> Hi Dan,
> I read on FreeBSD mailinglist you had some performance issues with ZFS.
> Perhaps i can help you with that.
> You seem to be running a single mirror, which means you won't have any speed
> benefit regarding writes, and usually RAID1 implementations offer little to
> no acceleration to read requests also; some even just read from the master
> disk and don't touch the 'slave' mirrored disk unless when writing. ZFS is
> alot more modern however, although i did not test performance of its mirror
> implementation.
> But, benchmarking I/O can be tricky:
> 1) you use bonnie, but bonnie's tests are performed without a 'cooldown'
> period between the tests; meaning that when test 2 starts, data from test 1
> is still being processed. For single disks and simple I/O this is not so
> bad, but for large write-back buffers and more complex I/O buffering, this
> may be inappropriate. I had patched bonnie some time in the past, but if you
> just want a MB/s number you can use DD for that.
> 2) The diskinfo tiny benchmark is single queue only i assume, meaning that
> it would not scale well or at all on RAID-arrays. Actual filesystems on
> RAID-arrays use multiple-queue; meaning it would not read one sector at a
> time, but read 8 blocks (of 16KiB) "ahead"; this is called read-ahead and
> for traditional UFS filesystems its controlled by the sysctl vfs.read_max
> variable. ZFS works differently though, but you still need a "real"
> benchmark.
> 3) You need low-latency hardware; in particular, no PCI controller should be
> used. Only PCI-express based controllers or chipset-integrated Serial ATA
> cotrollers have proper performance. PCI can hurt performance very badly, and
> has high interrupt CPU usage. Generally you should avoid PCI. PCI-express is
> fine though, its a completely different interface that is in many ways the
> opposite of what PCI was.
> 4) Testing actual realistic I/O performance (in IOps) is very difficult. But
> testing sequential performance should be alot easier. You may try using dd
> for this.
> For example, you can use dd on raw devices:
> dd if=/dev/ad4 of=/dev/null bs=1M count=1000
> I will explain each parameter:
> if=/dev/ad4 is the input file, the "read source"
> of=/dev/null is the output file, the "write destination". /dev/null means it
> just goes no-where; so this is a read-only benchmark
> bs=1M is the blocksize, howmuch data to transfer per time. default is 512 or
> the sector size; but that's very slow. A value between 64KiB and 1024KiB is
> appropriate. bs=1M will select 1MiB or 1024KiB.
> count=1000 means transfer 1000 pieces, and with bs=1M that means 1000 * 1MiB
> = 1000MiB.
> This example was raw reading sequentially from the start of the device
> /dev/ad4. If you want to test RAIDs, you need to work at the filesystem
> level. You can use dd for that too:
> dd if=/dev/zero of=/path/to/ZFS/mount/zerofile.000 bs=1M count=2000
> This command will read from /dev/zero (all zeroes) and write to a file on
> ZFS-mounted filesystem, it will create the file "zerofile.000" and write
> 2000MiB of zeroes to that file.
> So this command tests write-performance of the ZFS-mounted filesystem. To
> test read performance, you need to clear caches first by unmounting that
> filesystem and re-mounting it again. This would free up memory containing
> parts of the filesystem as cached (reported in top as "Inact(ive)" instead
> of "Free").
> Please do make sure you double-check a dd command before running it, and run
> as normal user instead of root. A wrong dd command may write to the wrong
> destination and do things you don't want. The only real thing you need to
> check is the write destination (of=....). That's where dd is going to write
> to, so make sure its the target you intended. A common mistake made by
> myself was to write dd of=... if=... (starting with of instead of if) and
> thus actually doing something the other way around than what i was meant to
> do. This can be disastrous if you work with live data, so be careful! ;-)
> Hope any of this was helpful. During the dd benchmark, you can of course
> open a second SSH terminal and start "gstat" to see the devices current I/O
> stats.
> Kind regards,
> Jason

Hi and thanks for your tips, I appreciate it :)

[jago at atombsd ~]$ dd if=/dev/zero of=/home/jago/test1 bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes transferred in 36.206372 secs (29656156 bytes/sec)

[jago at atombsd ~]$ dd if=/dev/zero of=/home/jago/test2 bs=1M count=4096
4096+0 records in
4096+0 records out
4294967296 bytes transferred in 143.878615 secs (29851325 bytes/sec)

This works out to 1GB in 36,2 seconds / 28,2mb/s in the first test and
4GB in 143.8 seconds / 28,4mb/s and somewhat consistent with the
bonnie results. It also sadly seems to confirm the very slow speed :(
The disks are attached to a 4-port Sil3124 controller and again, my
Windows benchmarks showing 65mb/s+ were done on exact same machine,
with same disks attached to the same controller. Only difference was
that in Windows the disks weren't in a mirror configuration but were
tested individually. I do understand that a mirror setup offers
roughly the same write speed as individual disk, while the read speed
usually varies from "equal to individual disk speed" to "nearly the
throughput of both disks combined" depending on the implementation,
but there is no obvious reason I am seeing why my setup offers both
read and write speeds roughly 1/3 to 1/2 of what the individual disks
are capable of. Dmesg shows:

atapci0: <SiI 3124 SATA300 controller> port 0x1000-0x100f mem
0x90108000-0x9010807f,0x90100000-0x90107fff irq 21 at device 0.0 on
ad8: 1907729MB <WDC WD20EADS-32R6B0 01.00A01> at ata4-master SATA300
ad10: 1907729MB <WDC WD20EADS-00R6B0 01.00A01> at ata5-master SATA300

I do recall also testing an alternative configuration in the past,
where I would boot off an UFS disk and have the ZFS mirror consist of
2 discs directly. The bonnie numbers in that case were in line with my
expectations, I was seeing 65-70mb/s. Note: again, exact same
hardware, exact same disks attached to the exact same controller. In
my knowledge, Solaris/OpenSolaris has an issue where they have to
automatically disable disk cache if ZFS is used on top of partitions
instead of raw disks, but to my knowledge (I recall reading this from
multiple reputable sources) this issue does not affect FreeBSD.

- Sincerely,
Dan Naumov

More information about the freebsd-questions mailing list