ZFS read performance disparity between clone and parent

Matthew Ahrens mahrens at delphix.com
Sat Jun 13 03:50:45 UTC 2015


On Wed, May 13, 2015 at 11:54 AM, Nathan Weeks <weeks at iastate.edu> wrote:

> While troubleshooting performance disparities between development and
> production jails hosting PostgreSQL instances, I noticed (with the help of
> dtruss) that the 8k read() performance in the production jail was an order
> of
> magnitude worse than the read() performance in the development jail. As the
> ZFS file system hosting the production jail was cloned from a snapshot of
> the
> development jail, and had not been modified, this didn't make sense to me.
>
> Using "dd" command with an 8k block size to emulate the PostgreSQL read()
> size, I observed a large performance difference between reading one of the
> large (1G) underlying postgres database files in the development jail's
> file
> system vs. the corresponding file in the cloned file system:
>
> # dd if=/jails/dev/usr/local/pgsql/data/base/16399/16436 of=/dev/null
> bs=8192
> 131072+0 records in
> 131072+0 records out
> 1073741824 bytes transferred in 4.198993 secs (255714128 bytes/sec)
> # dd if=/jails/prod/usr/local/pgsql/data/base/16399/16436 of=/dev/null
> bs=8192
> 131072+0 records in
> 131072+0 records out
> 1073741824 bytes transferred in 17.314135 secs (62015331 bytes/sec)
> # ls -l /jails/dev/usr/local/pgsql/data/base/16399/16436
> /jails/prod/usr/local/pgsql/data/base/16399/16436
> -rw------- 1 70 70 1073741824 Feb 5 16:41
> /jails/dev/usr/local/pgsql/data/base/16399/16436
> -rw------- 1 70 70 1073741824 Feb 5 16:41
> /jails/prod/usr/local/pgsql/data/base/16399/16436
>
> I repeated this exercise several times to verify the read performance
> difference. Interestingly, prefixing the "dd" command with "/usr/bin/time
> -l"
> revealed that in both cases, "block input operations" was 0, apparently
> indicating that both files were being read from cache. In neither case did
> "zpool iostat 1" show significant I/O being performed during the execution
> of
> the "dd" command.
>
> Has anyone else encountered a similar issue, and know of an
> explanation/solution/better workaround? I had previously assumed that there
> would be no performance difference between reading a file on a ZFS file
> system
> and the corresponding file on a cloned file system when none of the data
> blocks have changed (this is FreeBSD 9.3, so the "Single Copy ARC" feature
> should apply). Dedup isn't being used on any file system.
>

An unfortunate byproduct of the "single copy ARC" is that the first dataset
to read a block performs better than subsequent readers, which have to do
an extra bcopy() of the block.  You should be able to alleviate this by
evicting the buffers by unmounting the first filesystem or running "zinject
-a".  We are working on a fix for this as part of the "compressed ARC"
feature that will be coming soon.

You can verify this by looking at the flame graphs of CPU usage in both
cases.(http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html)

--matt


>
> The output of zfs-stats follows; I can provide any additional info that
> might
> be of use in identifying the cause of this issue.
>
> ------------------------------------------------------------------------
> ZFS Subsystem Report                            Wed May 13 12:22:00 2015
> ------------------------------------------------------------------------
>
> System Information:
>
>         Kernel Version:                         903000 (osreldate)
>         Hardware Platform:                      amd64
>         Processor Architecture:                 amd64
>
>         ZFS Storage pool Version:               5000
>         ZFS Filesystem Version:                 5
>
> FreeBSD 9.3-RELEASE-p5 #0: Mon Nov 3 22:38:58 UTC 2014 root
> 12:22PM  up 166 days,  3:36, 7 users, load averages: 2.34, 2.31, 2.17
>
> ------------------------------------------------------------------------
>
> System Memory:
>
>         8.83%   21.95   GiB Active,     1.67%   4.14    GiB Inact
>         68.99%  171.40  GiB Wired,      0.40%   1.00    GiB Cache
>         20.10%  49.93   GiB Free,       0.01%   16.12   MiB Gap
>
>         Real Installed:                         256.00  GiB
>         Real Available:                 99.99%  255.97  GiB
>         Real Managed:                   97.06%  248.43  GiB
>
>         Logical Total:                          256.00  GiB
>         Logical Used:                   78.49%  200.92  GiB
>         Logical Free:                   21.51%  55.08   GiB
>
> Kernel Memory:                                  117.28  GiB
>         Data:                           99.98%  117.25  GiB
>         Text:                           0.02%   26.07   MiB
>
> Kernel Memory Map:                              241.10  GiB
>         Size:                           43.83%  105.67  GiB
>         Free:                           56.17%  135.43  GiB
>
> ------------------------------------------------------------------------
>
> ARC Summary: (HEALTHY)
>         Memory Throttle Count:                  0
>
> ARC Misc:
>         Deleted:                                143.56m
>         Recycle Misses:                         275.73m
>         Mutex Misses:                           1.50m
>         Evict Skips:                            20.24b
>
> ARC Size:                               99.77%  127.71  GiB
>         Target Size: (Adaptive)         100.00% 128.00  GiB
>         Min Size (Hard Limit):          12.50%  16.00   GiB
>         Max Size (High Water):          8:1     128.00  GiB
>
> ARC Size Breakdown:
>         Recently Used Cache Size:       68.86%  88.15   GiB
>         Frequently Used Cache Size:     31.14%  39.85   GiB
>
> ARC Hash Breakdown:
>         Elements Max:                           27.87m
>         Elements Current:               40.13%  11.18m
>         Collisions:                             1.95b
>         Chain Max:                              26
>         Chains:                                 2.44m
>
> ------------------------------------------------------------------------
>
> ARC Efficiency:                                 88.77b
>         Cache Hit Ratio:                99.52%  88.34b
>         Cache Miss Ratio:               0.48%   426.00m
>         Actual Hit Ratio:               98.86%  87.76b
>
>         Data Demand Efficiency:         99.99%  58.75b
>         Data Prefetch Efficiency:       98.47%  1.08b
>
>         CACHE HITS BY CACHE LIST:
>           Anonymously Used:             0.21%   187.51m
>           Most Recently Used:           1.93%   1.71b
>           Most Frequently Used:         97.41%  86.05b
>           Most Recently Used Ghost:     0.04%   39.14m
>           Most Frequently Used Ghost:   0.41%   358.78m
>
>         CACHE HITS BY DATA TYPE:
>           Demand Data:                  66.49%  58.74b
>           Prefetch Data:                1.21%   1.07b
>           Demand Metadata:              31.74%  28.04b
>           Prefetch Metadata:            0.56%   491.01m
>
>         CACHE MISSES BY DATA TYPE:
>           Demand Data:                  1.70%   7.26m
>           Prefetch Data:                3.89%   16.56m
>           Demand Metadata:              83.84%  357.15m
>           Prefetch Metadata:            10.57%  45.03m
>
> ------------------------------------------------------------------------
>
> L2ARC is disabled
>
> ------------------------------------------------------------------------
>
> File-Level Prefetch: (HEALTHY)
>
> DMU Efficiency:                                 187.26b
>         Hit Ratio:                      82.21%  153.94b
>         Miss Ratio:                     17.79%  33.32b
>
>         Colinear:                               33.32b
>           Hit Ratio:                    0.01%   3.35m
>           Miss Ratio:                   99.99%  33.32b
>
>         Stride:                                 150.63b
>           Hit Ratio:                    100.00% 150.63b
>           Miss Ratio:                   0.00%   453.04k
>
> DMU Misc:
>         Reclaim:                                33.32b
>           Successes:                    0.36%   118.64m
>           Failures:                     99.64%  33.20b
>
>         Streams:                                3.31b
>           +Resets:                      0.00%   20.36k
>           -Resets:                      100.00% 3.31b
>           Bogus:                                0
>
> ------------------------------------------------------------------------
>
> VDEV cache is disabled
>
> ------------------------------------------------------------------------
>
> ZFS Tunables (sysctl):
>         kern.maxusers                           16718
>         vm.kmem_size                            266754412544
>         vm.kmem_size_scale                      1
>         vm.kmem_size_min                        0
>         vm.kmem_size_max                        329853485875
>         vfs.zfs.l2c_only_size                   0
>         vfs.zfs.mfu_ghost_data_lsize            63695688192
>         vfs.zfs.mfu_ghost_metadata_lsize        8300248064
>         vfs.zfs.mfu_ghost_size                  71995936256
>         vfs.zfs.mfu_data_lsize                  34951425024
>         vfs.zfs.mfu_metadata_lsize              4976638976
>         vfs.zfs.mfu_size                        41843978240
>         vfs.zfs.mru_ghost_data_lsize            41844330496
>         vfs.zfs.mru_ghost_metadata_lsize        23598693888
>         vfs.zfs.mru_ghost_size                  65443024384
>         vfs.zfs.mru_data_lsize                  67918019072
>         vfs.zfs.mru_metadata_lsize              411918848
>         vfs.zfs.mru_size                        71823354880
>         vfs.zfs.anon_data_lsize                 0
>         vfs.zfs.anon_metadata_lsize             0
>         vfs.zfs.anon_size                       29893120
>         vfs.zfs.l2arc_norw                      1
>         vfs.zfs.l2arc_feed_again                1
>         vfs.zfs.l2arc_noprefetch                1
>         vfs.zfs.l2arc_feed_min_ms               200
>         vfs.zfs.l2arc_feed_secs                 1
>         vfs.zfs.l2arc_headroom                  2
>         vfs.zfs.l2arc_write_boost               8388608
>         vfs.zfs.l2arc_write_max                 8388608
>         vfs.zfs.arc_meta_limit                  34359738368
>         vfs.zfs.arc_meta_used                   34250008792
>         vfs.zfs.arc_min                         17179869184
>         vfs.zfs.arc_max                         137438953472
>         vfs.zfs.dedup.prefetch                  1
>         vfs.zfs.mdcomp_disable                  0
>         vfs.zfs.nopwrite_enabled                1
>         vfs.zfs.zfetch.array_rd_sz              1048576
>         vfs.zfs.zfetch.block_cap                256
>         vfs.zfs.zfetch.min_sec_reap             2
>         vfs.zfs.zfetch.max_streams              8
>         vfs.zfs.prefetch_disable                0
>         vfs.zfs.no_scrub_prefetch               0
>         vfs.zfs.no_scrub_io                     0
>         vfs.zfs.resilver_min_time_ms            3000
>         vfs.zfs.free_min_time_ms                1000
>         vfs.zfs.scan_min_time_ms                1000
>         vfs.zfs.scan_idle                       50
>         vfs.zfs.scrub_delay                     4
>         vfs.zfs.resilver_delay                  2
>         vfs.zfs.top_maxinflight                 32
>         vfs.zfs.write_to_degraded               0
>         vfs.zfs.mg_noalloc_threshold            0
>         vfs.zfs.condense_pct                    200
>         vfs.zfs.metaslab.weight_factor_enable   0
>         vfs.zfs.metaslab.preload_enabled        1
>         vfs.zfs.metaslab.preload_limit          3
>         vfs.zfs.metaslab.unload_delay           8
>         vfs.zfs.metaslab.load_pct               50
>         vfs.zfs.metaslab.min_alloc_size         10485760
>         vfs.zfs.metaslab.df_free_pct            4
>         vfs.zfs.metaslab.df_alloc_threshold     131072
>         vfs.zfs.metaslab.debug_unload           0
>         vfs.zfs.metaslab.debug_load             0
>         vfs.zfs.metaslab.gang_bang              131073
>         vfs.zfs.check_hostid                    1
>         vfs.zfs.spa_asize_inflation             24
>         vfs.zfs.deadman_enabled                 1
>         vfs.zfs.deadman_checktime_ms            5000
>         vfs.zfs.deadman_synctime_ms             1000000
>         vfs.zfs.recover                         0
>         vfs.zfs.txg.timeout                     5
>         vfs.zfs.min_auto_ashift                 9
>         vfs.zfs.max_auto_ashift                 13
>         vfs.zfs.vdev.cache.bshift               16
>         vfs.zfs.vdev.cache.size                 0
>         vfs.zfs.vdev.cache.max                  16384
>         vfs.zfs.vdev.trim_on_init               1
>         vfs.zfs.vdev.write_gap_limit            4096
>         vfs.zfs.vdev.read_gap_limit             32768
>         vfs.zfs.vdev.aggregation_limit          131072
>         vfs.zfs.vdev.scrub_max_active           2
>         vfs.zfs.vdev.scrub_min_active           1
>         vfs.zfs.vdev.async_write_max_active     10
>         vfs.zfs.vdev.async_write_min_active     1
>         vfs.zfs.vdev.async_read_max_active      3
>         vfs.zfs.vdev.async_read_min_active      1
>         vfs.zfs.vdev.sync_write_max_active      10
>         vfs.zfs.vdev.sync_write_min_active      10
>         vfs.zfs.vdev.sync_read_max_active       10
>         vfs.zfs.vdev.sync_read_min_active       10
>         vfs.zfs.vdev.max_active                 1000
>         vfs.zfs.vdev.bio_delete_disable         0
>         vfs.zfs.vdev.bio_flush_disable          0
>         vfs.zfs.vdev.trim_max_pending           64
>         vfs.zfs.vdev.trim_max_bytes             2147483648
>         vfs.zfs.cache_flush_disable             0
>         vfs.zfs.zil_replay_disable              0
>         vfs.zfs.sync_pass_rewrite               2
>         vfs.zfs.sync_pass_dont_compress         5
>         vfs.zfs.sync_pass_deferred_free         2
>         vfs.zfs.zio.use_uma                     0
>         vfs.zfs.snapshot_list_prefetch          0
>         vfs.zfs.version.ioctl                   3
>         vfs.zfs.version.zpl                     5
>         vfs.zfs.version.spa                     5000
>         vfs.zfs.version.acl                     1
>         vfs.zfs.debug                           0
>         vfs.zfs.super_owner                     0
>         vfs.zfs.trim.enabled                    1
>         vfs.zfs.trim.max_interval               1
>         vfs.zfs.trim.timeout                    30
>         vfs.zfs.trim.txg_delay                  32
>
> ------------------------------------------------------------------------
>
> --
> Nathan Weeks
> USDA-ARS Corn Insects and Crop Genetics Research Unit
> Crop Genome Informatics Laboratory
> Iowa State University
> http://weeks.public.iastate.edu/
> _______________________________________________
> freebsd-fs at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-fs
> To unsubscribe, send any mail to "freebsd-fs-unsubscribe at freebsd.org"
>


More information about the freebsd-fs mailing list