current zfs tuning in RELENG_7 (AMD64) suggestions ?

Adam McDougall mcdouga9 at egr.msu.edu
Sat May 2 00:05:51 UTC 2009


On Fri, May 01, 2009 at 04:42:09PM -0400, Mike Tancsa wrote:

  I gave the AMD64 version of 7.2 RC2 a spin and all installed as 
  expected off the dvd
  
  INTEL S3200SHV MB, Core2Duo, 4G of RAM
  
<snip>

  The writes are all within the normal variance of the tests except for 
  b).  Is there anything else that should be tuned ? Not that I am 
  looking for any "magic bullets" but I just want to run this backup 
  server as best as possible
  
                 -------Sequential Output-------- ---Sequential Input-- 
  --Random--
                 -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- 
  --Seeks---
             MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU
  a        5000  98772 54.7 153111 31.5 100015 21.1 178730 85.3 368782 
  32.5 161.6  0.6
  b        5000 101271 57.9 154765 31.5  61325 13.9 176741 84.6 372477 
  32.8 149.3  0.6
  c        5000 102331 57.1 159559 29.5 105767 17.4 144410 63.8 299317 
  19.9 167.9  0.6
  d        5000 107308 58.6 175004 32.4 117926 18.8 143657 63.4 305126 
  20.0 167.6  0.6


<begin brain dump>
  
You might want to try running gstat -I 100000 during the test to see   
how fast each drive flushes the cache from ram and if there are any
disks slower than the others.  I've found some cards or slots cause 
drives to perform slower than other drives in the system, dragging down
performance of the raid to the slowest drive(s).  Individual performance
testing of the drives outside of the raid might reveal something too,
even just to find out what the maximum sequential speed of one drive is
so you know that 4x(speed) is the best to hope for in raid tests.

ZFS tends to cache heavily at the start of each write and you will 
probably see it bounce between no IO and furious writes, until the ram 
cache fills up more and it has no choice but to write almost constantly.  
This can affect the results between runs.  I would recommend a larger 
count= that results in a test run of 30-60 seconds at least.

Additionally, try other zfs raid types such as mirror and stripe to see 
if raidz is acting as an unexpectedly large bottleneck, I've found its 
serial write speed usually leaves something to be desired.  Even if the 
other raid levels won't work realistically in the long run, its useful 
to raise the bar to find out what extra performance your IO setup can 
push. It could be useful to compare with gstripe and graid3 for further 
hardware performance evaluation.  On the other hand, if you can 
read/write data faster than your network connection can push, you're 
probably at a workable level.

Also, I believe that zfs uses a cluster size up to 128k (queueing 
multiple writes if it can, depending on the disk subsystem) so I think 
the computer has to do extra work if you are giving it bs=2048k since 
zfs will have to cut that into 16 pieces, sending one piece to each 
drive.  You might try bs=512k or bs=128k for example to see if this has 
a positive effect.  In a traditional raid5 setup, I've found I get head 
over heals the best performance when my bs= is the same size as the raid 
stripe size multiplied by the number of drives, and this gets weird when 
you have an odd number of drives because your optimum write size might 
be something like 768k which probably no application is going to produce 
:)  Also it makes it hard to optimize UFS for a larger stripe size when
the cluster sizes are generally limited to 16k such as in Solaris.

  Results tend to fluctuate a bit.
  
  offsitetmp# dd if=/dev/zero of=/tank1/test bs=2048k count=1000
  1000+0 records in
  1000+0 records out
  2097152000 bytes transferred in 10.016818 secs (209363092 bytes/sec)
  offsitetmp#
  offsitetmp# dd if=/dev/zero of=/tank1/test bs=2048k count=1000
  1000+0 records in
  1000+0 records out
  2097152000 bytes transferred in 10.733547 secs (195382943 bytes/sec)
  offsitetmp#
  
  Drives are raidz
  
  ad1: 1430799MB <Seagate ST31500341AS CC1H> at ata3-master SATA300
  ad2: 1430799MB <Seagate ST31500341AS CC1H> at ata4-master SATA300
  ad3: 1430799MB <Seagate ST31500341AS CC1H> at ata5-master SATA300
  ad4: 1430799MB <Seagate ST31500341AS CC1H> at ata6-master SATA300
  
  on ich9
  
           ---Mike
  
  
  
  --------------------------------------------------------------------
  Mike Tancsa,                                      tel +1 519 651 3400
  Sentex Communications,                            mike at sentex.net
  Providing Internet since 1994                    www.sentex.net
  Cambridge, Ontario Canada                         www.sentex.net/mike
  
  _______________________________________________
  freebsd-stable at freebsd.org mailing list
  http://lists.freebsd.org/mailman/listinfo/freebsd-stable
  To unsubscribe, send any mail to "freebsd-stable-unsubscribe at freebsd.org"
  


More information about the freebsd-stable mailing list