[ZFS] Raid 10 performance issues

Mon Jun 10 11:12:57 UTC 2013

On Mon, Jun 10, 2013 at 10:46:15AM +0200, Pierre Lemazurier wrote:
> I add my /boot/loader.conf for more information :
> 
> zfs_load="YES"
> vm.kmem_size="22528M"
> vfs.zfs.arc_min="20480M"
> vfs.zfs.arc_max="20480M"
> vfs.zfs.prefetch_disable="0"
> vfs.zfs.txg.timeout="5"
> vfs.zfs.vdev.max_pending="10"
> vfs.zfs.vdev.min_pending="4"
> vfs.zfs.write_limit_override="0"
> vfs.zfs.no_write_throttle="0"

Please remove these variables:

vm.kmem_size="22528M"
vfs.zfs.arc_min="20480M"

You do not need to set vm.kmem_size any longer (that was addressed long
ago, during the mid-days of stable/8), and you should let the ARC shrink
if need be (my concern here is that possibly limiting the lower end of
the ARC size may be triggering some other portions of FreeBSD's VM or
ZFS to behave oddly.  No proof/evidence, just guesswork on my part).

At bare minimum, *definitely* remove the vm.kmem_size setting.

Next, please remove the following variables, as these serve no purpose
(they are the defaults in 9.1-RELEASE):

vfs.zfs.prefetch_disable="0"
vfs.zfs.txg.timeout="5"
vfs.zfs.vdev.max_pending="10"
vfs.zfs.vdev.min_pending="4"
vfs.zfs.write_limit_override="0"
vfs.zfs.no_write_throttle="0"

So in short all you should have in your loader.conf is:

zfs_load="yes"
vfs.zfs.arc_max="20480M"

> Le 07/06/2013 17:07, Pierre Lemazurier a écrit :
> >Hi, i think i suffer of write and read performance issues on my zpool.
> >
> >About my system and hardware :
> >
> >uname -a
> >FreeBSD bsdnas 9.1-RELEASE FreeBSD 9.1-RELEASE #0 r243825: Tue Dec 4
> >09:23:10 UTC 2012
> >root at farrell.cse.buffalo.edu:/usr/obj/usr/src/sys/GENERIC amd64
> >
> >sysinfo -a : http://www.privatepaste.com/b32f34c938

Going forward, I would recommend also providing "dmesg".  It is a lot
easier to read to most of us.

All I can work out is that your storage controller is mps(4), except I
can't see any of the important details about it.  dmesg would give that,
not this weird "sysinfo" thing.

I would also like to request "pciconf -lvbc" output.

> >- 24 (4gbx6) GB DDR3 ECC :
> >http://www.ec.kingston.com/ecom/configurator_new/partsinfo.asp?ktcpartno=KVR16R11D8/4HC
> >
> >- 14x this drive :
> >http://www.wdc.com/global/products/specs/?driveID=1086&language=1

Worth pointing out for readers:

These are 4096-byte sector 2TB WD Red drives.

> >- server :
> >http://www.supermicro.com/products/system/1u/5017/sys-5017r-wrf.cfm?parts=show
> >
> >- CPU :
> >http://ark.intel.com/fr/products/64594/Intel-Xeon-Processor-E5-2620-15M-Cache-2_00-GHz-7_20-GTs-Intel-QPI
> >
> >- chassis :
> >http://www.supermicro.com/products/chassis/4u/847/sc847e16-rjbod1.cfm
> >- HBA sas connector :
> >http://www.lsi.com/products/storagecomponents/Pages/LSISAS9200-8e.aspx
> >- Cable between chassis and server :
> >http://www.provantage.com/supermicro-cbl-0166l~7SUPA01R.htm
> >
> >I use this command for test write speed :dd if=/dev/zero of=test.dd
> >bs=2M count=10000
> >I use this command for test read speed :dd if=test.dd of=/dev/null bs=2M
> >count=10000
> >
> >Of course no compression on zfs dataset.
> >
> >Test on one of this disk format with UFS :
> >
> >Write :
> >some gstat raising : http://www.privatepaste.com/dd31fafaa6
> >speed around 140 mo/s and something like 1100 iops
> >dd result : 20971520000 bytes transferred in 146.722126 secs (142933589
> >bytes/sec)
> >
> >Read :
> >I think I read on RAM (20971520000 bytes transferred in 8.813298 secs
> >(2379531480 bytes/sec)).
> >Then I make the test on all the drive (dd if=/dev/gpt/disk14.nop
> >of=/dev/null bs=2M count=10000)
> >some gstat raising : http://www.privatepaste.com/d022b7c480
> >speed around 140 mo/s again an near 1100+ iops
> >dd reslut : 20971520000 bytes transferred in 142.895212 secs (146761530
> >bytes/sec)

Looks about right for a single WD Red 2TB drive.  Important: THIS IS A
SINGLE DRIVE.

> >ZFS - I make my zpool on this way : http://www.privatepaste.com/e74d9cc3b9

Looks good to me.  This is effectively RAID-10 as you said (a stripe of
mirrors).

> >zpool status : http://www.privatepaste.com/0276801ef6
> >zpool get all : http://www.privatepaste.com/74b37a2429
> >zfs get all : http://www.privatepaste.com/e56f4a33f8
> >zfs-stats -a : http://www.privatepaste.com/f017890aa1
> >zdb : http://www.privatepaste.com/7d723c5556
> >
> >With this setup I hope to have near 7x more speed for write and near 14x
> >for
> >read than the UFS device alone. Then for be realistic, something like
> >850 mo/s for write and 1700 mo/s for read.

Your hopes may be shattered by the reality of how controllers behave and
operate (performance-wise) as well as many other things, including some
ZFS tunables.  We shall see.

> >ZFS – test :
> >
> >Write :
> >gstat raising : http://www.privatepaste.com/7cefb9393a
> >zpool iostat -v 1 of a fastest try : http://www.privatepaste.com/8ade4defbe
> >dd result : 20971520000 bytes transferred in 54.326509 secs (386027381
> >bytes/sec)
> >
> >386 mo/s more than twice less than I expect.

One thing to be aware of: while the dd took 54 seconds, the I/O to the
pool probably continued for long after that.  Your average speed to each
disk at that time was (just estimating it here) ~55MBytes/second.

I would assume what you're seeing above is probably the speed between
/dev/zero and the ZFS ARC, with (of course) the controller and driver in
the way.

We know that your disks can do about 110-140MBytes/second each, so the
performance hit has got to be in one of the following places:

1. ZFS itself,
2. Controller, controller driver (mps(4)), or controller firmware,
3. On-die MCH (memory controller)
4. PCIe bus speed limitations or other whatnots.

The place to start is with #1, ZFS.

See the bottom of my mail for advice.

> >Read :
> >I export and import the pool for limit the ARC effect. I don't know how
> >to do better, I hope that sufficient.

You could have checked using "top -b" (before and after export); look
for the "ARC" line.

I tend to just reboot the system, but export should result in a full
pending I/O flush (from ARC, etc.) to all the devices.  I would do this
and wait about 15 seconds + check with gstat before doing more
performance tests.

> >gstat raising : http://www.privatepaste.com/130ce43af1
> >zpool iostat -v 1 : http://privatepaste.com/eb5f9d3432
> >dd result : 20971520000 bytes transferred in 30.347214 secs (691052563
> >bytes/sec)
> >690 mo/s 2,5x less than I expect.
> >
> >
> >It's appear to not be an hardware issue, when I do a dd test of each
> >whole disk at the same time with the command dd if=/dev/gpt/diskX
> >of=/dev/null bs=1M count=10000, I have this gstat raising :
> >http://privatepaste.com/df9f63fd4d
> >
> >Near 130 mo/s for each device, something like I expect.

You're thinking of hardware in too simply a fashion -- if only it were
that simple.

> >In your opinion where the problem come from ?

Not enough information at this time to narrow down where the issue is.

Things to try:

1. Start with the initial loader.conf modifications I stated.  The
vm.kmem_size removal may help.

2. Possibly trying vfs.zfs.no_write_throttle="1" in loader.conf +
rebooting + re-doing this test.  What that tunable does:

https://blogs.oracle.com/roch/entry/the_new_zfs_write_throttle

You can also Google "vfs.zfs.no_write_throttle" and see that it's been
discussed quite a bit, including some folks saying performance
tremendously increases when they set this to 1.  

3. Given the massive size of your disk array and how much memory you
have, you may also want to consider adjusting some of these (possibly
increasing vfs.zfs.txg.timeout to make I/O flushing to your disks happen
*less* often; I haven't tinkered with the other two):

vfs.zfs.txg.timeout="5"
vfs.zfs.vdev.max_pending="10"
vfs.zfs.vdev.min_pending="4"

These also come to mind (these are the defaults):

vfs.zfs.write_limit_max="1069071872"
vfs.zfs.write_limit_min="33554432"

sysctl -d will give you descriptions of these.  I have never had to tune
any of these, however, but that's also because the pools I've built have
consisted of much smaller numbers of disks (3 or 4 at most).  I am also
used to ahci(4) and have avoided all other controllers for a multitude
of reasons (not saying that's the cause of your problem here, just
saying that's the stance I've chosen to take).

You might also try limiting your ARC maximum (vfs.zfs.arc_max) to
something smaller -- say, 8GBytes.  See if that has an effect.

4. "sysctl -a | grep zfs" is a very useful piece of information that you
should do along with "gstat" and "zpool iostat -v".  The counters and
information shown there are very, very helpful a lot of the time.  There
are particular ones that indicate certain performance-hindering
scenarios.

5. Your "UFS tests" only tested a single disk, while your ZFS tests
tested 14 disks in a RAID-10-like fashion.  You could try reproducing
the RAID-10 setup using gvinum(8) and use UFS and see what sort of
performance you get there.

6. Try re-doing the tests but with less drives involved -- say, 6
instead of 14.  See if the throughput to each drive is increased
compared to with 14 drives.

In general, "profiling" ZFS like this is tricky and requires folks who
are very much in-the-know and know how to go about accomplishing this
task.  Others more familiar with how to do this may need to step up to
the plate, but no support/response is guaranteed (if you need that, try
Solaris).

-- 
| Jeremy Chadwick                                   jdc at koitsu.org |
| UNIX Systems Administrator                http://jdc.koitsu.org/ |
| Making life hard for others since 1977.             PGP 4BD6C0CB |