8.0-RELEASE/amd64 - full ZFS install - low read and write disk performance

Sun Jan 24 18:12:31 UTC 2010

On Sun, Jan 24, 2010 at 7:42 PM, Dan Naumov <dan.naumov at gmail.com> wrote:
> On Sun, Jan 24, 2010 at 7:05 PM, Jason Edwards <sub.mesa at gmail.com> wrote:
>> Hi Dan,
>>
>> I read on FreeBSD mailinglist you had some performance issues with ZFS.
>> Perhaps i can help you with that.
>>
>> You seem to be running a single mirror, which means you won't have any speed
>> benefit regarding writes, and usually RAID1 implementations offer little to
>> no acceleration to read requests also; some even just read from the master
>> disk and don't touch the 'slave' mirrored disk unless when writing. ZFS is
>> alot more modern however, although i did not test performance of its mirror
>> implementation.
>>
>> But, benchmarking I/O can be tricky:
>>
>> 1) you use bonnie, but bonnie's tests are performed without a 'cooldown'
>> period between the tests; meaning that when test 2 starts, data from test 1
>> is still being processed. For single disks and simple I/O this is not so
>> bad, but for large write-back buffers and more complex I/O buffering, this
>> may be inappropriate. I had patched bonnie some time in the past, but if you
>> just want a MB/s number you can use DD for that.
>>
>> 2) The diskinfo tiny benchmark is single queue only i assume, meaning that
>> it would not scale well or at all on RAID-arrays. Actual filesystems on
>> RAID-arrays use multiple-queue; meaning it would not read one sector at a
>> time, but read 8 blocks (of 16KiB) "ahead"; this is called read-ahead and
>> for traditional UFS filesystems its controlled by the sysctl vfs.read_max
>> variable. ZFS works differently though, but you still need a "real"
>> benchmark.
>>
>> 3) You need low-latency hardware; in particular, no PCI controller should be
>> used. Only PCI-express based controllers or chipset-integrated Serial ATA
>> cotrollers have proper performance. PCI can hurt performance very badly, and
>> has high interrupt CPU usage. Generally you should avoid PCI. PCI-express is
>> fine though, its a completely different interface that is in many ways the
>> opposite of what PCI was.
>>
>> 4) Testing actual realistic I/O performance (in IOps) is very difficult. But
>> testing sequential performance should be alot easier. You may try using dd
>> for this.
>>
>>
>> For example, you can use dd on raw devices:
>>
>> dd if=/dev/ad4 of=/dev/null bs=1M count=1000
>>
>> I will explain each parameter:
>>
>> if=/dev/ad4 is the input file, the "read source"
>>
>> of=/dev/null is the output file, the "write destination". /dev/null means it
>> just goes no-where; so this is a read-only benchmark
>>
>> bs=1M is the blocksize, howmuch data to transfer per time. default is 512 or
>> the sector size; but that's very slow. A value between 64KiB and 1024KiB is
>> appropriate. bs=1M will select 1MiB or 1024KiB.
>>
>> count=1000 means transfer 1000 pieces, and with bs=1M that means 1000 * 1MiB
>> = 1000MiB.
>>
>>
>>
>> This example was raw reading sequentially from the start of the device
>> /dev/ad4. If you want to test RAIDs, you need to work at the filesystem
>> level. You can use dd for that too:
>>
>> dd if=/dev/zero of=/path/to/ZFS/mount/zerofile.000 bs=1M count=2000
>>
>> This command will read from /dev/zero (all zeroes) and write to a file on
>> ZFS-mounted filesystem, it will create the file "zerofile.000" and write
>> 2000MiB of zeroes to that file.
>> So this command tests write-performance of the ZFS-mounted filesystem. To
>> test read performance, you need to clear caches first by unmounting that
>> filesystem and re-mounting it again. This would free up memory containing
>> parts of the filesystem as cached (reported in top as "Inact(ive)" instead
>> of "Free").
>>
>> Please do make sure you double-check a dd command before running it, and run
>> as normal user instead of root. A wrong dd command may write to the wrong
>> destination and do things you don't want. The only real thing you need to
>> check is the write destination (of=....). That's where dd is going to write
>> to, so make sure its the target you intended. A common mistake made by
>> myself was to write dd of=... if=... (starting with of instead of if) and
>> thus actually doing something the other way around than what i was meant to
>> do. This can be disastrous if you work with live data, so be careful! ;-)
>>
>> Hope any of this was helpful. During the dd benchmark, you can of course
>> open a second SSH terminal and start "gstat" to see the devices current I/O
>> stats.
>>
>> Kind regards,
>> Jason
>
> Hi and thanks for your tips, I appreciate it :)
>
> [jago at atombsd ~]$ dd if=/dev/zero of=/home/jago/test1 bs=1M count=1024
> 1024+0 records in
> 1024+0 records out
> 1073741824 bytes transferred in 36.206372 secs (29656156 bytes/sec)
>
> [jago at atombsd ~]$ dd if=/dev/zero of=/home/jago/test2 bs=1M count=4096
> 4096+0 records in
> 4096+0 records out
> 4294967296 bytes transferred in 143.878615 secs (29851325 bytes/sec)
>
> This works out to 1GB in 36,2 seconds / 28,2mb/s in the first test and
> 4GB in 143.8 seconds / 28,4mb/s and somewhat consistent with the
> bonnie results. It also sadly seems to confirm the very slow speed :(
> The disks are attached to a 4-port Sil3124 controller and again, my
> Windows benchmarks showing 65mb/s+ were done on exact same machine,
> with same disks attached to the same controller. Only difference was
> that in Windows the disks weren't in a mirror configuration but were
> tested individually. I do understand that a mirror setup offers
> roughly the same write speed as individual disk, while the read speed
> usually varies from "equal to individual disk speed" to "nearly the
> throughput of both disks combined" depending on the implementation,
> but there is no obvious reason I am seeing why my setup offers both
> read and write speeds roughly 1/3 to 1/2 of what the individual disks
> are capable of. Dmesg shows:
>
> atapci0: <SiI 3124 SATA300 controller> port 0x1000-0x100f mem
> 0x90108000-0x9010807f,0x90100000-0x90107fff irq 21 at device 0.0 on
> pci4
> ad8: 1907729MB <WDC WD20EADS-32R6B0 01.00A01> at ata4-master SATA300
> ad10: 1907729MB <WDC WD20EADS-00R6B0 01.00A01> at ata5-master SATA300
>
> I do recall also testing an alternative configuration in the past,
> where I would boot off an UFS disk and have the ZFS mirror consist of
> 2 discs directly. The bonnie numbers in that case were in line with my
> expectations, I was seeing 65-70mb/s. Note: again, exact same
> hardware, exact same disks attached to the exact same controller. In
> my knowledge, Solaris/OpenSolaris has an issue where they have to
> automatically disable disk cache if ZFS is used on top of partitions
> instead of raw disks, but to my knowledge (I recall reading this from
> multiple reputable sources) this issue does not affect FreeBSD.
>
> - Sincerely,
> Dan Naumov

To add some additional info, for good measure I decided to check if
disk write cache is enabled and sure enough it was:

[jago at atombsd /var/log]$ sysctl hw.ata
hw.ata.setmax: 0
hw.ata.wc: 1
hw.ata.atapi_dma: 1
hw.ata.ata_dma_check_80pin: 1
hw.ata.ata_dma: 1

Also if you want to see/know the exact way the system was built and
installed, here is the build script I used:
http://jago.pp.fi/zfsinst.sh

The reason me (and this script) use MBR partitioning instead of GPT is
because my motherboard cannot reliably boot off GPT, but this should
not be really relevant to the performance issues shown.

- Sincerely,
Dan Naumov