Improving ZFS performance for large directories
Kevin Day
toasty at dragondata.com
Tue Feb 19 20:11:03 UTC 2013
Sorry for the late followup, I've been doing some testing with an L2ARC device.
>> Doing it twice back-to-back makes a bit of difference but it's still slow either way.
>
> ZFS can very conservative about caching data and twice might not be enough.
> I suggest you try 8-10 times, or until the time stops reducing.
>
Timing doing an "ls" in large directories 20 times, the first is the slowest, then all subsequent listings are roughly the same. There doesn't appear to be any gain after 20 repetitions
>> I think some of the issue is that nothing is being allowed to stay cached long.
>
> Well ZFS doesn't do any time-based eviction so if things aren't
> staying in the cache, it's because they are being evicted by things
> that ZFS considers more deserving.
>
> Looking at the zfs-stats you posted, it looks like your workload has
> very low locality of reference (the data hitrate is very) low. If
> this is not what you expect then you need more RAM. OTOH, your
> vfs.zfs.arc_meta_used being above vfs.zfs.arc_meta_limit suggests that
> ZFS really wants to cache more metadata (by default ZFS has a 25%
> metadata, 75% data split in ARC to prevent metadata caching starving
> data caching). I would go even further than the 50:50 split suggested
> later and try 75:25 (ie, triple the current vfs.zfs.arc_meta_limit).
>
> Note that if there is basically no locality of reference in your
> workload (as I suspect), you can even turn off data caching for
> specific filesystems with zfs set primarycache=metadata tank/foo
> (note that you still need to increase vfs.zfs.arc_meta_limit to
> allow ZFS to use the the ARC to cache metadata).
Now that I've got an L2ARC device (250GB), I've been doing some playing. With the defaults (primarycache and secondarycache set to all), I really didn't see much improvement. The SSD filled itself pretty quickly, but it's hit rate was around 1%, even after 48 hours.
Thinking I'd make the primary cache metadata only, and the secondary cache "all" would improve things, I wiped the device (SATA secure erase to make sure) and tried again. This was much worse, i'm guessing because there was some amount of real file data being looked at frequently, the SSD was basically getting hammered for read access with 100% utilization, and things were far slower.
I wiped the SSD and tried again with primarycache=all, secondarycache=metadata and things have improved. Even with boosting up vfs.zfs.l2arc_write_max, it took quite a while before things stabilized. I'm guessing there isn't a huge amount of data, but there's such poor locality and sweeping the entire filesystem takes so long that it's going to take a while before it decides what's worth being cached. After about 20 hours in this configuration, it's a HUGE difference on directory speeds though. Before adding the SSD, an "ls" in a directory with 65k files would take 10-30 seconds, it's now down to about 0.2 seconds.
So I'm guessing the theory was right, there was more metadata than would fit in ARC so it was constantly churning. I'm a bit surprised that continually doing an ls in a big directory didn't make it stick better, but these filesystems are HUGE so there may be some inefficiencies happening here. There are roughly 29M files, growing at about 50k files/day. We recently upgraded, and are now at 96 3TB drives in the pool.
What I also find surprising is this:
L2 ARC Size: (Adaptive) 22.70 GiB
Header Size: 0.31% 71.49 MiB
L2 ARC Breakdown: 23.77m
Hit Ratio: 34.26% 8.14m
Miss Ratio: 65.74% 15.62m
Feeds: 63.28k
It's a 250G drive, and only 22G is being used, and there's still a ~66% miss rate. Is there any way to tell why more metadata isn't being pushed to the L2ARC? I see a pretty high count for "Passed Headroom" and "Tried Lock Failures", but I'm not sure if that's normal. Including the lengthy output of zfs-stat below in case anyone sees something that stands out as being unusual.
------------------------------------------------------------------------
ZFS Subsystem Report Tue Feb 19 20:08:19 2013
------------------------------------------------------------------------
System Information:
Kernel Version: 901000 (osreldate)
Hardware Platform: amd64
Processor Architecture: amd64
ZFS Storage pool Version: 28
ZFS Filesystem Version: 5
FreeBSD 9.1-RC2 #1: Tue Oct 30 20:37:38 UTC 2012 root
8:08PM up 20:40, 3 users, load averages: 0.47, 0.50, 0.52
------------------------------------------------------------------------
System Memory:
8.41% 5.22 GiB Active, 10.18% 6.32 GiB Inact
77.39% 48.05 GiB Wired, 1.52% 966.99 MiB Cache
2.50% 1.55 GiB Free, 0.00% 888.00 KiB Gap
Real Installed: 64.00 GiB
Real Available: 99.97% 63.98 GiB
Real Managed: 97.04% 62.08 GiB
Logical Total: 64.00 GiB
Logical Used: 86.22% 55.18 GiB
Logical Free: 13.78% 8.82 GiB
Kernel Memory: 23.18 GiB
Data: 99.91% 23.16 GiB
Text: 0.09% 21.27 MiB
Kernel Memory Map: 52.10 GiB
Size: 35.21% 18.35 GiB
Free: 64.79% 33.75 GiB
------------------------------------------------------------------------
ARC Summary: (HEALTHY)
Memory Throttle Count: 0
ARC Misc:
Deleted: 10.24m
Recycle Misses: 3.48m
Mutex Misses: 24.85k
Evict Skips: 12.79m
ARC Size: 92.50% 28.25 GiB
Target Size: (Adaptive) 92.50% 28.25 GiB
Min Size (Hard Limit): 25.00% 7.64 GiB
Max Size (High Water): 4:1 30.54 GiB
ARC Size Breakdown:
Recently Used Cache Size: 62.35% 17.62 GiB
Frequently Used Cache Size: 37.65% 10.64 GiB
ARC Hash Breakdown:
Elements Max: 1.99m
Elements Current: 99.16% 1.98m
Collisions: 8.97m
Chain Max: 14
Chains: 586.97k
------------------------------------------------------------------------
ARC Efficiency: 1.15b
Cache Hit Ratio: 97.66% 1.12b
Cache Miss Ratio: 2.34% 26.80m
Actual Hit Ratio: 72.75% 833.30m
Data Demand Efficiency: 98.39% 33.94m
Data Prefetch Efficiency: 8.11% 7.60m
CACHE HITS BY CACHE LIST:
Anonymously Used: 23.88% 267.15m
Most Recently Used: 4.70% 52.60m
Most Frequently Used: 69.79% 780.70m
Most Recently Used Ghost: 0.64% 7.13m
Most Frequently Used Ghost: 0.98% 10.99m
CACHE HITS BY DATA TYPE:
Demand Data: 2.99% 33.40m
Prefetch Data: 0.06% 616.42k
Demand Metadata: 71.38% 798.44m
Prefetch Metadata: 25.58% 286.13m
CACHE MISSES BY DATA TYPE:
Demand Data: 2.04% 546.67k
Prefetch Data: 26.07% 6.99m
Demand Metadata: 37.96% 10.18m
Prefetch Metadata: 33.93% 9.09m
------------------------------------------------------------------------
L2 ARC Summary: (HEALTHY)
Passed Headroom: 3.62m
Tried Lock Failures: 3.17m
IO In Progress: 21.18k
Low Memory Aborts: 20
Free on Write: 7.07k
Writes While Full: 134
R/W Clashes: 1.63k
Bad Checksums: 0
IO Errors: 0
SPA Mismatch: 0
L2 ARC Size: (Adaptive) 22.70 GiB
Header Size: 0.31% 71.02 MiB
L2 ARC Breakdown: 23.78m
Hit Ratio: 34.25% 8.15m
Miss Ratio: 65.75% 15.64m
Feeds: 63.47k
L2 ARC Buffer:
Bytes Scanned: 65.51 TiB
Buffer Iterations: 63.47k
List Iterations: 4.06m
NULL List Iterations: 64.89k
L2 ARC Writes:
Writes Sent: 100.00% 29.89k
------------------------------------------------------------------------
File-Level Prefetch: (HEALTHY)
DMU Efficiency: 1.24b
Hit Ratio: 64.29% 798.62m
Miss Ratio: 35.71% 443.54m
Colinear: 443.54m
Hit Ratio: 0.00% 20.45k
Miss Ratio: 100.00% 443.52m
Stride: 772.29m
Hit Ratio: 99.99% 772.21m
Miss Ratio: 0.01% 81.30k
DMU Misc:
Reclaim: 443.52m
Successes: 0.05% 220.47k
Failures: 99.95% 443.30m
Streams: 26.42m
+Resets: 0.05% 12.73k
-Resets: 99.95% 26.41m
Bogus: 0
------------------------------------------------------------------------
VDEV cache is disabled
------------------------------------------------------------------------
ZFS Tunables (sysctl):
kern.maxusers 384
vm.kmem_size 66662760448
vm.kmem_size_scale 1
vm.kmem_size_min 0
vm.kmem_size_max 329853485875
vfs.zfs.l2c_only_size 5242113536
vfs.zfs.mfu_ghost_data_lsize 178520064
vfs.zfs.mfu_ghost_metadata_lsize 6486959104
vfs.zfs.mfu_ghost_size 6665479168
vfs.zfs.mfu_data_lsize 11863127552
vfs.zfs.mfu_metadata_lsize 123386368
vfs.zfs.mfu_size 12432947200
vfs.zfs.mru_ghost_data_lsize 14095171584
vfs.zfs.mru_ghost_metadata_lsize 8351076864
vfs.zfs.mru_ghost_size 22446248448
vfs.zfs.mru_data_lsize 2076449280
vfs.zfs.mru_metadata_lsize 4655490560
vfs.zfs.mru_size 7074721792
vfs.zfs.anon_data_lsize 0
vfs.zfs.anon_metadata_lsize 0
vfs.zfs.anon_size 1605632
vfs.zfs.l2arc_norw 1
vfs.zfs.l2arc_feed_again 1
vfs.zfs.l2arc_noprefetch 1
vfs.zfs.l2arc_feed_min_ms 200
vfs.zfs.l2arc_feed_secs 1
vfs.zfs.l2arc_headroom 2
vfs.zfs.l2arc_write_boost 52428800
vfs.zfs.l2arc_write_max 26214400
vfs.zfs.arc_meta_limit 16398159872
vfs.zfs.arc_meta_used 16398120264
vfs.zfs.arc_min 8199079936
vfs.zfs.arc_max 32796319744
vfs.zfs.dedup.prefetch 1
vfs.zfs.mdcomp_disable 0
vfs.zfs.write_limit_override 0
vfs.zfs.write_limit_inflated 206088929280
vfs.zfs.write_limit_max 8587038720
vfs.zfs.write_limit_min 33554432
vfs.zfs.write_limit_shift 3
vfs.zfs.no_write_throttle 0
vfs.zfs.zfetch.array_rd_sz 1048576
vfs.zfs.zfetch.block_cap 256
vfs.zfs.zfetch.min_sec_reap 2
vfs.zfs.zfetch.max_streams 8
vfs.zfs.prefetch_disable 0
vfs.zfs.mg_alloc_failures 12
vfs.zfs.check_hostid 1
vfs.zfs.recover 0
vfs.zfs.txg.synctime_ms 1000
vfs.zfs.txg.timeout 5
vfs.zfs.vdev.cache.bshift 16
vfs.zfs.vdev.cache.size 0
vfs.zfs.vdev.cache.max 16384
vfs.zfs.vdev.write_gap_limit 4096
vfs.zfs.vdev.read_gap_limit 32768
vfs.zfs.vdev.aggregation_limit 131072
vfs.zfs.vdev.ramp_rate 2
vfs.zfs.vdev.time_shift 6
vfs.zfs.vdev.min_pending 4
vfs.zfs.vdev.max_pending 128
vfs.zfs.vdev.bio_flush_disable 0
vfs.zfs.cache_flush_disable 0
vfs.zfs.zil_replay_disable 0
vfs.zfs.zio.use_uma 0
vfs.zfs.snapshot_list_prefetch 0
vfs.zfs.version.zpl 5
vfs.zfs.version.spa 28
vfs.zfs.version.acl 1
vfs.zfs.debug 0
vfs.zfs.super_owner 0
More information about the freebsd-fs
mailing list