ZFS: Strange performance issues

Thu Oct 22 13:57:46 UTC 2009

Christopher Key wrote:
> I'm running FreeBSD 7.2 amd64 on a system with 2GB RAM.  I've a zfs pool
> using raidz1 over five 2Tb SATA drives connected via a port multiplier
> and a RR2314 card.
>
> I can write to a filesystem on this pool at approx 20MB/s:
>
> # dd if=/dev/urandom of=$FS/testdump bs=1m count=1k
> 1024+0 records in
> 1024+0 records out
> 1073741824 bytes transferred in 47.096440 secs (22798790 bytes/sec)
>
> and zpool iostat -v is consistent with this
>
>                    capacity     operations    bandwidth
> pool             used  avail   read  write   read  write
> --------------  -----  -----  -----  -----  -----  -----
> films           6.37T  2.69T     11    440  53.2K  23.0M
>   raidz1        6.37T  2.69T     11    440  53.2K  23.0M
>     da0             -      -      1    214  88.2K  5.58M
>     da1             -      -      1    209  88.6K  5.58M
>     da2             -      -      1    211  76.0K  5.70M
>     da3             -      -      1    213  88.6K  5.77M
>     da4             -      -      1    213  88.6K  5.71M
> --------------  -----  -----  -----  -----  -----  -----
>
>
> However, the read behaviour is strange:
>
> dd if=$FS/testdump of=/dev/null bs=1m count=1k
>
> 1024+0 records in
> 1024+0 records out
> 1073741824 bytes transferred in 40.392055 secs (26582996 bytes/sec)
>
> but here, zpool iostat -v is odd:
>
>                    capacity     operations    bandwidth
> pool             used  avail   read  write   read  write
> --------------  -----  -----  -----  -----  -----  -----
> films           6.37T  2.69T  1.52K      0   194M      0
>   raidz1        6.37T  2.69T  1.52K      0   194M      0
>     da0             -      -    420      0  48.2M      0
>     da1             -      -    274      0  24.0M      0
>     da2             -      -    270      0  24.4M      0
>     da3             -      -    418      0  48.4M      0
>     da4             -      -    418      0  48.4M      0
> --------------  -----  -----  -----  -----  -----  -----
>
> Notice that dd was reading at ~27MB/s, but zfs is reading from the vdev
> at ~200MB/s.  Also odd is that fact the reduced read rates for da1, da2.
>
>
> I'm struggling to understand what's happening to the extra data being
> read.  The most likely scenario seems to be that ZFS is inflating its
> read size, knowing that it won't delay the transfer significantly, and
> hoping to pull in some useful data to the cache.  However, it's either
> failing the cache this data correctly, or the file is highly
> non-contiguous, and the extra data read doesn't contain anythign useful
> to out read.
>
> I'm also somewhat surpised by the poor performance of the pool.  From
> memory, when it was first configured (on identical hardware and
> software), I could write at ~130MB/s and read at ~200MB/s.
>
> Once conclusion is that the pool is suffering from something akin to
> fragmentation, perhaps with files always being allocated from very small
> blacks.  The vast majority of the data comprises large (~1Gb) files,
> that are written to one 'import' pool, moved to the main pool, then
> never modified.  There are however a lot (~5000) of small (<1k) files
> that get rewritten half hourly, and I'm wondering if that might be
> causing problems, and confusing ZFS's block sizing algorithm.
>
> Can anyone shed any light on what might be going on, or how to further
> diagnose this problem.  Do any of my tentative conclusions make any sense?
>
>
>   
I've restarted the system, and the problems have dissappeared.  I can
now write at ~50MB/s and read at ~200MB/s, including reading from files
that were written before the reboot, when both reading and writing were
slow.  This seems to rule out any problem with the on disk layout of
data.  Also, the pool read rate now matches the filesystem read rate (as
opposed to reading 200MB/s vs 27MB/s before the reboot).

The only conclusion that I can come to now is that caching was simply
non-existent before the reboot.  Comparing the result of

# sysctl -a | grep zfs

before and after the reboot, see below, the significant differences seem
to be the twentyfold increase in "kstat.zfs.misc.arcstats.p",
"kstat.zfs.misc.arcstats.c" and "kstat.zfs.misc.arcstats.size".  I'm not
sure of the correct interpretation of these, but if they are really
indicating a 20MB cache vs. a 400MB cache I could well believe that
would explain the difference.

I'm not sure what might have caused the reduction in the cache size,
there aren't any significant memory pressures that I'm aware of, unless
there was excessive content in /tmp

Any thoughts?

Kind regards,

Christopher Key

Before:

vfs.zfs.arc_min: 21353344
vfs.zfs.arc_max: 512480256
vfs.zfs.mdcomp_disable: 0
vfs.zfs.prefetch_disable: 0
vfs.zfs.zio.taskq_threads: 0
vfs.zfs.recover: 0
vfs.zfs.vdev.cache.size: 10485760
vfs.zfs.vdev.cache.max: 16384
vfs.zfs.cache_flush_disable: 0
vfs.zfs.zil_disable: 0
vfs.zfs.debug: 0
kstat.zfs.misc.arcstats.hits: 14715277
kstat.zfs.misc.arcstats.misses: 1013356
kstat.zfs.misc.arcstats.demand_data_hits: 10180440
kstat.zfs.misc.arcstats.demand_data_misses: 119214
kstat.zfs.misc.arcstats.demand_metadata_hits: 4026012
kstat.zfs.misc.arcstats.demand_metadata_misses: 141654
kstat.zfs.misc.arcstats.prefetch_data_hits: 116044
kstat.zfs.misc.arcstats.prefetch_data_misses: 745305
kstat.zfs.misc.arcstats.prefetch_metadata_hits: 392781
kstat.zfs.misc.arcstats.prefetch_metadata_misses: 7183
kstat.zfs.misc.arcstats.mru_hits: 2456403
kstat.zfs.misc.arcstats.mru_ghost_hits: 12154
kstat.zfs.misc.arcstats.mfu_hits: 11832908
kstat.zfs.misc.arcstats.mfu_ghost_hits: 84764
kstat.zfs.misc.arcstats.deleted: 1853402
kstat.zfs.misc.arcstats.recycle_miss: 262586
kstat.zfs.misc.arcstats.mutex_miss: 1180
kstat.zfs.misc.arcstats.evict_skip: 20216195
kstat.zfs.misc.arcstats.hash_elements: 6645
kstat.zfs.misc.arcstats.hash_elements_max: 34534
kstat.zfs.misc.arcstats.hash_collisions: 588095
kstat.zfs.misc.arcstats.hash_chains: 715
kstat.zfs.misc.arcstats.hash_chain_max: 7
kstat.zfs.misc.arcstats.p: 412187380
kstat.zfs.misc.arcstats.c: 415807010
kstat.zfs.misc.arcstats.c_min: 21353344
kstat.zfs.misc.arcstats.c_max: 512480256
kstat.zfs.misc.arcstats.size: 416037376

After:

vfs.zfs.arc_min: 21353344
vfs.zfs.arc_max: 512480256
vfs.zfs.mdcomp_disable: 0
vfs.zfs.prefetch_disable: 0
vfs.zfs.zio.taskq_threads: 0
vfs.zfs.recover: 0
vfs.zfs.vdev.cache.size: 10485760
vfs.zfs.vdev.cache.max: 16384
vfs.zfs.cache_flush_disable: 0
vfs.zfs.zil_disable: 0
vfs.zfs.debug: 0
kstat.zfs.misc.arcstats.hits: 14715277
kstat.zfs.misc.arcstats.misses: 1013356
kstat.zfs.misc.arcstats.demand_data_hits: 10180440
kstat.zfs.misc.arcstats.demand_data_misses: 119214
kstat.zfs.misc.arcstats.demand_metadata_hits: 4026012
kstat.zfs.misc.arcstats.demand_metadata_misses: 141654
kstat.zfs.misc.arcstats.prefetch_data_hits: 116044
kstat.zfs.misc.arcstats.prefetch_data_misses: 745305
kstat.zfs.misc.arcstats.prefetch_metadata_hits: 392781
kstat.zfs.misc.arcstats.prefetch_metadata_misses: 7183
kstat.zfs.misc.arcstats.mru_hits: 2456403
kstat.zfs.misc.arcstats.mru_ghost_hits: 12154
kstat.zfs.misc.arcstats.mfu_hits: 11832908
kstat.zfs.misc.arcstats.mfu_ghost_hits: 84764
kstat.zfs.misc.arcstats.deleted: 1853402
kstat.zfs.misc.arcstats.recycle_miss: 262586
kstat.zfs.misc.arcstats.mutex_miss: 1180
kstat.zfs.misc.arcstats.evict_skip: 20216195
kstat.zfs.misc.arcstats.hash_elements: 6645
kstat.zfs.misc.arcstats.hash_elements_max: 34534
kstat.zfs.misc.arcstats.hash_collisions: 588095
kstat.zfs.misc.arcstats.hash_chains: 715
kstat.zfs.misc.arcstats.hash_chain_max: 7
kstat.zfs.misc.arcstats.p: 412187380
kstat.zfs.misc.arcstats.c: 415807010
kstat.zfs.misc.arcstats.c_min: 21353344
kstat.zfs.misc.arcstats.c_max: 512480256
kstat.zfs.misc.arcstats.size: 416037376