ZFS performance of various vdevs (long post)
Bradley W. Dutton
brad at duttonbros.com
Mon Jun 7 22:59:16 UTC 2010
Hi,
I just upgraded a 5x500 raidz (no NCQ) array to an 8x2tb raidz2 (NCQ)
array. In the process I was expecting my new setup to absolutely tear
through data due to having faster and additional drives. While the new
setup is considerably faster than the old, some of the throughput
rates weren't as high as I was expecting. I was hoping I could get
some help to understand how ZFS is working or possibly identify some
bottlenecks. My goal is to have ZFS on FreeBSD be the best it can.
Below are benchmarks of the old 5 drive array (normal/raidz1/raidz2)
and raidz2 of the new 8 drive array. As I'm using the new array I
can't reformat it to test the other vdev types.
Sorry in advance if this format is hard to read. Let me know if I
omitted any key information. I did several runs of each of these
commands and the results were in range of each other enough that I
didn't think any numbers were out of line due to caching.
The PC I'm using to test:
FreeBSD backup 8.1-PRERELEASE FreeBSD 8.1-PRERELEASE #0: Mon May 24
18:45:38 PDT 2010 root at backup:/usr/obj/usr/src/sys/BACKUP amd64
AMD Athlon X2 5600
4gigs of RAM
5 SATA drives are Western Digital RE2 (7200rpm) using on board
controller (NvidiaA nForce 570 SLI MCP, no NCQ):
WD5001ABYS (3 of these)
WD5000YS (2 of these)
Supermicro AOC-USAS-L8i PCI Express x8 controller (with NCQ):
8 Hitachi 2TB 7200rpm drives
Relevant /boot/loader.conf settings:
vm.kmem_size="3G"
vfs.zfs.arc_max="2100M"
vfs.zfs.arc_meta_limit="700M"
vfs.zfs.prefetch_disable="0"
My CPU metrics aren't anything official, just me monitoring top while
these commands are running. I mostly kept track of CPU to see if any
processes were CPU bound. These are a percentage of total CPU time on
the box, so 50% would be 1 core maxxed out.
Changing the dd blocksize didn't seem to affect anything so I left it
at 1M. Also, if the machine was running for a while and had various
items cached in the ARC the speeds could be much slower, as much as
half. The first ZFS benchmark was half as fast as the below numbers on
a warm box (running for several days), I rebooted to get max speed.
The faster numbers weren't due to the data being cached, I observed
higher throughput numbers using gstat. Instead of 30Mbytes/sec I would
see 60 or 70.
The RE2 drives do between 70-80Mbytes/sec sequential reading/writing:
#!/bin/sh
for disk in "ad4" "ad6" "ad10" "ad12" "ad14"
do
dd if=/dev/${disk} of=/dev/null bs=1m count=4000 &
done
4194304000 bytes transferred in 49.603534 secs (84556556 bytes/sec)
4194304000 bytes transferred in 51.679365 secs (81160130 bytes/sec)
4194304000 bytes transferred in 52.642995 secs (79674494 bytes/sec)
4194304000 bytes transferred in 57.742892 secs (72637581 bytes/sec)
4194304000 bytes transferred in 58.189738 secs (72079789 bytes/sec)
CPU usage is low when doing these 5 reads, <10%
The Hitachi drives do 120-130Mbytes/sec sequential read/write:
#!/bin/sh
for disk in "da0" "da1" "da2" "da3" "da4" "da5" "da6" "da7"
do
dd if=/dev/${disk} of=/dev/null bs=1m count=4000 &
done
4194304000 bytes transferred in 31.980469 secs (131152048 bytes/sec)
4194304000 bytes transferred in 32.349440 secs (129656155 bytes/sec)
4194304000 bytes transferred in 32.776024 secs (127968664 bytes/sec)
4194304000 bytes transferred in 32.951440 secs (127287427 bytes/sec)
4194304000 bytes transferred in 33.048651 secs (126913017 bytes/sec)
4194304000 bytes transferred in 33.057686 secs (126878331 bytes/sec)
4194304000 bytes transferred in 33.374149 secs (125675234 bytes/sec)
4194304000 bytes transferred in 35.226584 secs (119066441 bytes/sec)
CPU usage is around 25-30%
Now on to the ZFS benchmarks:
#
# a regular ZFS pool for the 5 drive array
#
zpool create bench /dev/ad4 /dev/ad6 /dev/ad10 /dev/ad12 /dev/ad14
dd if=/dev/zero of=/bench/test.file bs=1m count=12000
12582912000 bytes transferred in 39.687730 secs (317047913 bytes/sec)
30-35% CPU
All 5 drives are written to so we have:
317/5 = ~63Mbytes/sec
This is close to 70Mbytes/sec so I'm ok with these numbers. I'm not
sure how much overhead the checksumming is adding so that could
account for the throughput gap here?
dd if=/bench/test.file of=/dev/null bs=1m
12582912000 bytes transferred in 34.668165 secs (362952928 bytes/sec)
around 30% CPU
All 5 drives are read from so we have:
362/5 = ~72Mbytes/sec
This seems to be max speed considering the slowest drives in the pool
run at this speed.
#
# a ZFS raidz pool for the 5 drive array
#
zpool destroy bench
zpool create bench raidz /dev/ad4 /dev/ad6 /dev/ad10 /dev/ad12 /dev/ad14
dd if=/dev/zero of=/bench/test.file bs=1m count=12000
12582912000 bytes transferred in 54.357053 secs (231486281 bytes/sec)
CPU varied widely, between 30 and 70%, kernel process using most, then dd
Only 4 of 5 are writing actual data correct? so we have:
231/4 = ~58Mbytes/sec (this seems to be similar to gstat)
We are getting a bit slower here from our reference 70Mbytes/sec and
compared to 63 in the regular vdev.
dd if=/bench/test.file of=/dev/null bs=1m
12582912000 bytes transferred in 45.825533 secs (274582993 bytes/sec)
around 40% CPU, kernel then dd using the most CPU
Again only 4 of 5 have data so the throughput is this?
274/4 = ~68Mbytes/sec (looks to be similar to gstat)
This is good and close to max speed.
#
# a ZFS raidz2 pool for the 5 drive array
#
zpool destroy bench
zpool create bench raidz2 /dev/ad4 /dev/ad6 /dev/ad10 /dev/ad12 /dev/ad14
dd if=/dev/zero of=/bench/test.file bs=1m count=12000
12582912000 bytes transferred in 97.491160 secs (129067210 bytes/sec)
CPU varied a lot 15-50%, a burst or two to 75%
Only 3 of 5 are writing actual data correct? so we have:
129/3 = ~43Mbytes/sec (gstat was varying quite a bit here, as low as
5, as high as 60)
These speeds are now quite a bit lower than I would expect.
Calculation overhead is causing the discrepancy here? The CPU is too
slow?
dd if=/bench/test.file of=/dev/null bs=1m
12582912000 bytes transferred in 58.947959 secs (213457976 bytes/sec)
around 30% CPU
Only 3 of 5 have data and I'm not sure how to calculate throughput.
I'm guessing the round robin reads help boost these numbers (read 3
data disks + 1 parity so only 4 of 5 drives are in use for any given
read?). gstat shows rates around 40Mbytes/sec even though I would
expect closer to 60-70.
213/3 = ~71Mbytes/sec (although I don't think we can do this
calculation this way)
#
# ZFS raidz2 pool on the 8 drive array
# this pool is about 15% used so the read/write tests aren't necessarily
# on the fastest part of the disks.
#
zpool create tank raidz2 /dev/da0 /dev/da1 /dev/da2 /dev/da3 /dev/da4
/dev/da5 /dev/da6 /dev/da7
dd if=/dev/zero of=/tank/test.file bs=1m count=12000
12582912000 bytes transferred in 40.878876 secs (307809638 bytes/sec)
varying 40-70% CPU (a few bursts into the 90s), kernel then dd using
most of it
307/6 = ~51Mbytes/sec (gstat varied quite a big, 20-80, it seems to
average in the 50s as dd reported)
Per disk this isn't much faster than the old array, 51 compared to 43.
With a few bursts to 95% CPU it seems as though some of this could be
CPU bound.
dd if=/tank/test.file of=/dev/null bs=1m
12582912000 bytes transferred in 32.911291 secs (382328118 bytes/sec)
around 55% CPU, mostly kernel then dd
Similar to raidz2 test above, I don't think we can calculate
throughput this way. In any case, this is actually slower per disk
than the old array.
382/6 = ~64Mbytes/sec (gstat seemed to be around 50 so I'm guessing
the round robin reading is creating more throughput)
#
# wrap up
#
So the normal vdev performs closest to raw drive speeds. Raidz1 is
slower and raidz2 even more so. This is observable in the dd tests and
viewing in gstat. Any ideas why the raid numbers are slower? I've
tried to account for the fact that the raid vdevs have fewer data
disks. Would a faster CPU help here?
Unfortunately I migrated all of my data to the new array so I can't
run all of my tests on there. It would have been nice to see if a
normal pool (non raid) on these disks would have come close to max
speeds of 120-130Mbytes/sec (giving a total pool through put close to
1Gbyte/sec) as the smaller array did with respect to its max speed.
I noticed scrubbing the big array is CPU bound as the kernel process
is at 99% when running (total CPU is 50% as the as the scrub doesn't
multithread/process). The disks are running around 45-50Mbytes/sec in
gstat. Scrubbing the smaller/slower array isn't CPU bound and the
disks run at close to max speed.
Thanks for time,
Brad
More information about the freebsd-fs
mailing list