ZFS performance of various vdevs (long post)

Mon Jun 7 22:59:16 UTC 2010

Hi,

I just upgraded a 5x500 raidz (no NCQ) array to an 8x2tb raidz2 (NCQ)  
array. In the process I was expecting my new setup to absolutely tear  
through data due to having faster and additional drives. While the new  
setup is considerably faster than the old, some of the throughput  
rates weren't as high as I was expecting. I was hoping I could get  
some help to understand how ZFS is working or possibly identify some  
bottlenecks. My goal is to have ZFS on FreeBSD be the best it can.

Below are benchmarks of the old 5 drive array (normal/raidz1/raidz2)  
and raidz2 of the new 8 drive array. As I'm using the new array I  
can't reformat it to test the other vdev types.

Sorry in advance if this format is hard to read. Let me know if I  
omitted any key information. I did several runs of each of these  
commands and the results were in range of each other enough that I  
didn't think any numbers were out of line due to caching.

The PC I'm using to test:
FreeBSD backup 8.1-PRERELEASE FreeBSD 8.1-PRERELEASE #0: Mon May 24  
18:45:38 PDT 2010     root at backup:/usr/obj/usr/src/sys/BACKUP  amd64
AMD Athlon X2 5600
4gigs of RAM
5 SATA drives are Western Digital RE2 (7200rpm) using on board  
controller (NvidiaA nForce 570 SLI MCP, no NCQ):
WD5001ABYS (3 of these)
WD5000YS (2 of these)

Supermicro AOC-USAS-L8i PCI Express x8 controller (with NCQ):
8 Hitachi 2TB 7200rpm drives

Relevant /boot/loader.conf settings:
vm.kmem_size="3G"
vfs.zfs.arc_max="2100M"
vfs.zfs.arc_meta_limit="700M"
vfs.zfs.prefetch_disable="0"

My CPU metrics aren't anything official, just me monitoring top while  
these commands are running. I mostly kept track of CPU to see if any  
processes were CPU bound. These are a percentage of total CPU time on  
the box, so 50% would be 1 core maxxed out.

Changing the dd blocksize didn't seem to affect anything so I left it  
at 1M. Also, if the machine was running for a while and had various  
items cached in the ARC the speeds could be much slower, as much as  
half. The first ZFS benchmark was half as fast as the below numbers on  
a warm box (running for several days), I rebooted to get max speed.  
The faster numbers weren't due to the data being cached, I observed  
higher throughput numbers using gstat. Instead of 30Mbytes/sec I would  
see 60 or 70.

The RE2 drives do between 70-80Mbytes/sec sequential reading/writing:

#!/bin/sh
for disk in "ad4" "ad6" "ad10" "ad12" "ad14"
do
   dd if=/dev/${disk} of=/dev/null bs=1m count=4000 &
done

4194304000 bytes transferred in 49.603534 secs (84556556 bytes/sec)
4194304000 bytes transferred in 51.679365 secs (81160130 bytes/sec)
4194304000 bytes transferred in 52.642995 secs (79674494 bytes/sec)
4194304000 bytes transferred in 57.742892 secs (72637581 bytes/sec)
4194304000 bytes transferred in 58.189738 secs (72079789 bytes/sec)

CPU usage is low when doing these 5 reads, <10%

The Hitachi drives do 120-130Mbytes/sec sequential read/write:

#!/bin/sh
for disk in "da0" "da1" "da2" "da3" "da4" "da5" "da6" "da7"
do
   dd if=/dev/${disk} of=/dev/null bs=1m count=4000 &
done
4194304000 bytes transferred in 31.980469 secs (131152048 bytes/sec)
4194304000 bytes transferred in 32.349440 secs (129656155 bytes/sec)
4194304000 bytes transferred in 32.776024 secs (127968664 bytes/sec)
4194304000 bytes transferred in 32.951440 secs (127287427 bytes/sec)
4194304000 bytes transferred in 33.048651 secs (126913017 bytes/sec)
4194304000 bytes transferred in 33.057686 secs (126878331 bytes/sec)
4194304000 bytes transferred in 33.374149 secs (125675234 bytes/sec)
4194304000 bytes transferred in 35.226584 secs (119066441 bytes/sec)

CPU usage is around 25-30%

Now on to the ZFS benchmarks:

#
# a regular ZFS pool for the 5 drive array
#
zpool create bench /dev/ad4 /dev/ad6 /dev/ad10 /dev/ad12 /dev/ad14
dd if=/dev/zero of=/bench/test.file bs=1m count=12000
12582912000 bytes transferred in 39.687730 secs (317047913 bytes/sec)
30-35% CPU

All 5 drives are written to so we have:
317/5 = ~63Mbytes/sec
This is close to 70Mbytes/sec so I'm ok with these numbers. I'm not  
sure how much overhead the checksumming is adding so that could  
account for the throughput gap here?

dd if=/bench/test.file of=/dev/null bs=1m
12582912000 bytes transferred in 34.668165 secs (362952928 bytes/sec)
around 30% CPU
All 5 drives are read from so we have:
362/5 = ~72Mbytes/sec
This seems to be max speed considering the slowest drives in the pool  
run at this speed.

#
# a ZFS raidz pool for the 5 drive array
#
zpool destroy bench
zpool create bench raidz /dev/ad4 /dev/ad6 /dev/ad10 /dev/ad12 /dev/ad14
dd if=/dev/zero of=/bench/test.file bs=1m count=12000
12582912000 bytes transferred in 54.357053 secs (231486281 bytes/sec)
CPU varied widely, between 30 and 70%, kernel process using most, then dd

Only 4 of 5 are writing actual data correct? so we have:
231/4 = ~58Mbytes/sec (this seems to be similar to gstat)
We are getting a bit slower here from our reference 70Mbytes/sec and  
compared to 63 in the regular vdev.

dd if=/bench/test.file of=/dev/null bs=1m
12582912000 bytes transferred in 45.825533 secs (274582993 bytes/sec)
around 40% CPU, kernel then dd using the most CPU

Again only 4 of 5 have data so the throughput is this?
274/4 = ~68Mbytes/sec (looks to be similar to gstat)
This is good and close to max speed.

#
# a ZFS raidz2 pool for the 5 drive array
#
zpool destroy bench
zpool create bench raidz2 /dev/ad4 /dev/ad6 /dev/ad10 /dev/ad12 /dev/ad14

dd if=/dev/zero of=/bench/test.file bs=1m count=12000
12582912000 bytes transferred in 97.491160 secs (129067210 bytes/sec)
CPU varied a lot 15-50%, a burst or two to 75%

Only 3 of 5 are writing actual data correct? so we have:
129/3 = ~43Mbytes/sec (gstat was varying quite a bit here, as low as  
5, as high as 60)
These speeds are now quite a bit lower than I would expect.  
Calculation overhead is causing the discrepancy here? The CPU is too  
slow?

dd if=/bench/test.file of=/dev/null bs=1m
12582912000 bytes transferred in 58.947959 secs (213457976 bytes/sec)
around 30% CPU

Only 3 of 5 have data and I'm not sure how to calculate throughput.  
I'm guessing the round robin reads help boost these numbers (read 3  
data disks + 1 parity so only 4 of 5 drives are in use for any given  
read?). gstat shows rates around 40Mbytes/sec even though I would  
expect closer to 60-70.
213/3 = ~71Mbytes/sec (although I don't think we can do this  
calculation this way)

#
# ZFS raidz2 pool on the 8 drive array
# this pool is about 15% used so the read/write tests aren't necessarily
# on the fastest part of the disks.
#
zpool create tank raidz2 /dev/da0 /dev/da1 /dev/da2 /dev/da3 /dev/da4  
/dev/da5 /dev/da6 /dev/da7

dd if=/dev/zero of=/tank/test.file bs=1m count=12000
12582912000 bytes transferred in 40.878876 secs (307809638 bytes/sec)
varying 40-70% CPU (a few bursts into the 90s), kernel then dd using  
most of it

307/6 = ~51Mbytes/sec (gstat varied quite a big, 20-80, it seems to  
average in the 50s as dd reported)
Per disk this isn't much faster than the old array, 51 compared to 43.  
With a few bursts to 95% CPU it seems as though some of this could be  
CPU bound.

dd if=/tank/test.file of=/dev/null bs=1m
12582912000 bytes transferred in 32.911291 secs (382328118 bytes/sec)
around 55% CPU, mostly kernel then dd

Similar to raidz2 test above, I don't think we can calculate  
throughput this way. In any case, this is actually slower per disk  
than the old array.
382/6 = ~64Mbytes/sec (gstat seemed to be around 50 so I'm guessing  
the round robin reading is creating more throughput)

#
# wrap up
#
So the normal vdev performs closest to raw drive speeds. Raidz1 is  
slower and raidz2 even more so. This is observable in the dd tests and  
viewing in gstat. Any ideas why the raid numbers are slower? I've  
tried to account for the fact that the raid vdevs have fewer data  
disks. Would a faster CPU help here?

Unfortunately I migrated all of my data to the new array so I can't  
run all of my tests on there. It would have been nice to see if a  
normal pool (non raid) on these disks would have come close to max  
speeds of 120-130Mbytes/sec (giving a total pool through put close to  
1Gbyte/sec) as the smaller array did with respect to its max speed.

I noticed scrubbing the big array is CPU bound as the kernel process  
is at 99% when running (total CPU is 50% as the as the scrub doesn't  
multithread/process). The disks are running around 45-50Mbytes/sec in  
gstat. Scrubbing the smaller/slower array isn't CPU bound and the  
disks run at close to max speed.

Thanks for time,
Brad