Strange ZFS performance

Mon Apr 5 12:27:09 UTC 2010

On Mon, Apr 05, 2010 at 05:59:47AM -0500, Wes Morgan wrote:
> On Mon, 5 Apr 2010, Mikle Krutov wrote:
> 
> > On Sun, Apr 04, 2010 at 10:08:21PM -0500, Wes Morgan wrote:
> > > On Sun, 4 Apr 2010, Mikle wrote:
> > >
> > > > Hello, list! I've got some strange problem with one-disk zfs-pool:
> > > > read/write performance for the files on the fs (dd if=/dev/zero
> > > > of=/mountpoint/file bs=4M count=100) gives me only 2 MB/s, while reading
> > > > from the disk (dd if=/dev/disk of=/dev/zero bs=4M count=100) gives me
> > > > ~70MB/s. pool is about 80% full; PC with the pool has 2GB of ram (1.5 of
> > > > which is free); i've done no tuning in loader.conf and sysctl.conf for
> > > > zfs. In dmesg there is no error-messages related to the disk (dmesg|grep
> > > > ^ad12); s.m.a.r.t. seems OK. Some time ago disk was OK, nothing in
> > > > software/hardware has changed from that day. Any ideas what could have
> > > > happen to the disk?
> > >
> > > Has it ever been close to 100% full? How long has it been 80% full and
> > > what kind of files are on it, size wise?
> > No, it was never full. It is at 80% for about a week maybe. Most of the files are the video of the 200MB - 1.5GB size per file.
> 
> I'm wondering if your pool is fragmented. What does gstat or iostat -x
> output for the device look like when you're doing accessing the raw device
> versus filesystem access? A very interesting experiment (to me) would be
> to try these things:
> 
> 1) using dd to replicate the disc to another disc, block for block
> 2) zfs send to a newly created, empty pool (could take a while!)
> 
> Then, without rebooting, compare the performance of the "new" pools. For
> #1 you would need to export the pool first and detach the original device
> before importing the duplicate.
> 
> There might be a script out there somewhere to parse the output from zdb
> and turn it into a block map to identify fragmentation, but I'm not aware
> of one. If you did find that was the case, currently the only fix is to
> rebuild the pool.
device     r/s   w/s    kr/s    kw/s wait svc_t  %b
ad12      18.0   0.0  2302.6     0.0    4 370.0 199 
for cp'ing from one pool to another
gstat line is:
L(q)  ops/s   r/s  kBps  ms/r    w/s   kBps   ms/w   %busy Name
3     22     22   2814   69.0      0      0    0.0   71.7| gpt/pool2
For dd (now performance is crappy, too):
L(q)  ops/s  r/s  kBps   ms/r    w/s   kBps   ms/w   %busy Name
1     99     99  12658   14.2      0      0    0.0  140.4| gpt/pool2

Unfortunately, i got no free hdd with of same size, so the experiment will take time later. 

Also, zfs faq from sun tells me:
>Q: Are ZFS file systems shrinkable? How about fragmentation? Any need to defrag them?
>A: <...> The allocation algorithms are such that defragmentation is not an issue. 
Is that just marketing crap?

p.s. There was some mailing-list issue and we got second thread:

Also i forgot to post atacontrol cap ad12 to that thread, here it is:
Protocol              SATA revision 2.x
device model          WDC WD10EADS-00M2B0
serial number         WD-WMAV50024981
firmware revision     01.00A01
cylinders             16383
heads                 16
sectors/track         63
lba supported         268435455 sectors
lba48 supported       1953525168 sectors
dma supported
overlap not supported

Feature                      Support  Enable    Value           Vendor
write cache                    yes  yes
read ahead                     yes  yes
Native Command Queuing (NCQ)   yes   -  31/0x1F
Tagged Command Queuing (TCQ)   no   no  31/0x1F
SMART                          yes  yes
microcode download             yes  yes
security                       yes  no
power management               yes  yes
advanced power management      no   no  0/0x00
automatic acoustic management  yes  no  254/0xFE    128/0x80

http://permalink.gmane.org/gmane.os.freebsd.devel.file-systems/8876
>On Mon, Apr 05, 2010 at 12:30:59AM -0700, Jeremy Chadwick wrote:
>> I'm not sure why this mail didn't make it to the mailing list (I do see
>> it CC'd).  The attachments are included inline.
>> 
>> SMART stats for the disk look fine, so the disk is unlikely to be
>> responsible for this issue.  OP, could you also please provide the
>> output of "atacontrol cap ad12"?
>> 
>> The arcstats entry that interested me the most was this (prior to the
>> reboot):
>> 
>> > kstat.zfs.misc.arcstats.memory_throttle_count: 39958287
>> 
>> The box probably needs tuning in /boot/loader.conf to relieve this
>> problem.
>> 
>> Below are values I've been using on our production systems for a month
>> or two now.  These are for machines with 8GB RAM installed.  The OP may
>> need to adjust the first two parameters (I tend to go with RAM/2 for
>> vm.kmem_size and then subtract a bit more for arc_max (in this case
>> 512MB less than kmem_size)).
>> 
>> # Increase vm.kmem_size to allow for ZFS ARC to utilise more memory.
>> vm.kmem_size="4096M"
>> vfs.zfs.arc_max="3584M"
>> 
>> # Disable ZFS prefetching
>> # http://southbrain.com/south/2008/04/the-nightmare-comes-slowly-zfs.html
>> # Increases overall speed of ZFS, but when disk flushing/writes occur,
>> # system is less responsive (due to extreme disk I/O).
>> # NOTE: 8.0-RC1 disables this by default on systems <= 4GB RAM anyway
>> # NOTE: System has 8GB of RAM, so prefetch would be enabled by default.
>> vfs.zfs.prefetch_disable="1"
>> 
>> # Decrease ZFS txg timeout value from 30 (default) to 5 seconds.  This
>> # should increase throughput and decrease the "bursty" stalls that
>> # happen during immense I/O with ZFS.
>> # http://lists.freebsd.org/pipermail/freebsd-fs/2009-December/007343.html
>> # http://lists.freebsd.org/pipermail/freebsd-fs/2009-December/007355.html
>> vfs.zfs.txg.timeout="5"
>I've tried that tuning, now i have:
>vm.kmem_size="1024M"
>vfs.zfs.arc_max="512M"
>vfs.zfs.txg.timeout="5"
>No change in perfomance. Also, now reading directrly from hdd is slow, too (22-30MB/s), so that shows me
>that this could be some hardware problem (sata controller? but than the other disks were in same situation
>too. also, i've thought that that could be sata cable - and changed it - no speed after this).
>Additional information for dd:
>dd if=/dev/zero of=./file bs=4M count=10
>41943040 bytes transferred in 0.039295 secs (1067389864 bytes/sec)
>
>dd if=/dev/zero of=./file bs=4M count=20
>83886080 bytes transferred in 0.076702 secs (1093663943 bytes/sec)
-- 
Wbr,
Krutov Mikle