Improving ZFS performance for large directories

Kevin Day toasty at dragondata.com
Tue Feb 19 20:11:03 UTC 2013


Sorry for the late followup, I've been doing some testing with an L2ARC device.


>> Doing it twice back-to-back makes a bit of difference but it's still slow either way.
> 
> ZFS can very conservative about caching data and twice might not be enough.
> I suggest you try 8-10 times, or until the time stops reducing.
> 

Timing doing an "ls" in large directories 20 times, the first is the slowest, then all subsequent listings are roughly the same. There doesn't appear to be any gain after 20 repetitions 


>> I think some of the issue is that nothing is being allowed to stay cached long.
> 
> Well ZFS doesn't do any time-based eviction so if things aren't
> staying in the cache, it's because they are being evicted by things
> that ZFS considers more deserving.
> 
> Looking at the zfs-stats you posted, it looks like your workload has
> very low locality of reference (the data hitrate is very) low.  If
> this is not what you expect then you need more RAM.  OTOH, your
> vfs.zfs.arc_meta_used being above vfs.zfs.arc_meta_limit suggests that
> ZFS really wants to cache more metadata (by default ZFS has a 25%
> metadata, 75% data split in ARC to prevent metadata caching starving
> data caching).  I would go even further than the 50:50 split suggested
> later and try 75:25 (ie, triple the current vfs.zfs.arc_meta_limit).
> 
> Note that if there is basically no locality of reference in your
> workload (as I suspect), you can even turn off data caching for
> specific filesystems with zfs set primarycache=metadata tank/foo
> (note that you still need to increase vfs.zfs.arc_meta_limit to
> allow ZFS to use the the ARC to cache metadata).

Now that I've got an L2ARC device (250GB), I've been doing some playing. With the defaults (primarycache and secondarycache set to all), I really didn't see much improvement. The SSD filled itself pretty quickly, but it's hit rate was around 1%, even after 48 hours.

Thinking I'd make the primary cache metadata only, and the secondary cache "all" would improve things, I wiped the device (SATA secure erase to make sure) and tried again. This was much worse, i'm guessing because there was some amount of real file data being looked at frequently, the SSD was basically getting hammered for read access with 100% utilization, and things were far slower.

I wiped the SSD and tried again with primarycache=all, secondarycache=metadata and things have improved. Even with boosting up vfs.zfs.l2arc_write_max, it took quite a while before things stabilized. I'm guessing there isn't a huge amount of data, but there's such poor locality and sweeping the entire filesystem takes so long that it's going to take a while before it decides what's worth being cached. After about 20 hours in this configuration, it's a HUGE difference on directory speeds though. Before adding the SSD, an "ls" in a directory with 65k files would take 10-30 seconds, it's now down to about 0.2 seconds. 

So I'm guessing the theory was right, there was more metadata than would fit in ARC so it was constantly churning. I'm a bit surprised that continually doing an ls in a big directory didn't make it stick better, but these filesystems are HUGE so there may be some inefficiencies happening here. There are roughly 29M files, growing at about 50k files/day. We recently upgraded, and are now at 96 3TB drives in the pool.

What I also find surprising is this:

L2 ARC Size: (Adaptive)				22.70	GiB
	Header Size:			0.31%	71.49	MiB

L2 ARC Breakdown:				23.77m
	Hit Ratio:			34.26%	8.14m
	Miss Ratio:			65.74%	15.62m
	Feeds:					63.28k

It's a 250G drive, and only 22G is being used, and there's still a ~66% miss rate. Is there any way to tell why more metadata isn't being pushed to the L2ARC? I see a pretty high count for "Passed Headroom" and "Tried Lock Failures", but I'm not sure if that's normal.  Including the lengthy output of zfs-stat below in case anyone sees something that stands out as being unusual. 

------------------------------------------------------------------------
ZFS Subsystem Report				Tue Feb 19 20:08:19 2013
------------------------------------------------------------------------

System Information:

	Kernel Version:				901000 (osreldate)
	Hardware Platform:			amd64
	Processor Architecture:			amd64

	ZFS Storage pool Version:		28
	ZFS Filesystem Version:			5

FreeBSD 9.1-RC2 #1: Tue Oct 30 20:37:38 UTC 2012 root
 8:08PM  up 20:40, 3 users, load averages: 0.47, 0.50, 0.52

------------------------------------------------------------------------

System Memory:

	8.41%	5.22	GiB Active,	10.18%	6.32	GiB Inact
	77.39%	48.05	GiB Wired,	1.52%	966.99	MiB Cache
	2.50%	1.55	GiB Free,	0.00%	888.00	KiB Gap

	Real Installed:				64.00	GiB
	Real Available:			99.97%	63.98	GiB
	Real Managed:			97.04%	62.08	GiB

	Logical Total:				64.00	GiB
	Logical Used:			86.22%	55.18	GiB
	Logical Free:			13.78%	8.82	GiB

Kernel Memory:					23.18	GiB
	Data:				99.91%	23.16	GiB
	Text:				0.09%	21.27	MiB

Kernel Memory Map:				52.10	GiB
	Size:				35.21%	18.35	GiB
	Free:				64.79%	33.75	GiB

------------------------------------------------------------------------

ARC Summary: (HEALTHY)
	Memory Throttle Count:			0

ARC Misc:
	Deleted:				10.24m
	Recycle Misses:				3.48m
	Mutex Misses:				24.85k
	Evict Skips:				12.79m

ARC Size:				92.50%	28.25	GiB
	Target Size: (Adaptive)		92.50%	28.25	GiB
	Min Size (Hard Limit):		25.00%	7.64	GiB
	Max Size (High Water):		4:1	30.54	GiB

ARC Size Breakdown:
	Recently Used Cache Size:	62.35%	17.62	GiB
	Frequently Used Cache Size:	37.65%	10.64	GiB

ARC Hash Breakdown:
	Elements Max:				1.99m
	Elements Current:		99.16%	1.98m
	Collisions:				8.97m
	Chain Max:				14
	Chains:					586.97k

------------------------------------------------------------------------

ARC Efficiency:					1.15b
	Cache Hit Ratio:		97.66%	1.12b
	Cache Miss Ratio:		2.34%	26.80m
	Actual Hit Ratio:		72.75%	833.30m

	Data Demand Efficiency:		98.39%	33.94m
	Data Prefetch Efficiency:	8.11%	7.60m

	CACHE HITS BY CACHE LIST:
	  Anonymously Used:		23.88%	267.15m
	  Most Recently Used:		4.70%	52.60m
	  Most Frequently Used:		69.79%	780.70m
	  Most Recently Used Ghost:	0.64%	7.13m
	  Most Frequently Used Ghost:	0.98%	10.99m

	CACHE HITS BY DATA TYPE:
	  Demand Data:			2.99%	33.40m
	  Prefetch Data:		0.06%	616.42k
	  Demand Metadata:		71.38%	798.44m
	  Prefetch Metadata:		25.58%	286.13m

	CACHE MISSES BY DATA TYPE:
	  Demand Data:			2.04%	546.67k
	  Prefetch Data:		26.07%	6.99m
	  Demand Metadata:		37.96%	10.18m
	  Prefetch Metadata:		33.93%	9.09m

------------------------------------------------------------------------

L2 ARC Summary: (HEALTHY)
	Passed Headroom:			3.62m
	Tried Lock Failures:			3.17m
	IO In Progress:				21.18k
	Low Memory Aborts:			20
	Free on Write:				7.07k
	Writes While Full:			134
	R/W Clashes:				1.63k
	Bad Checksums:				0
	IO Errors:				0
	SPA Mismatch:				0

L2 ARC Size: (Adaptive)				22.70	GiB
	Header Size:			0.31%	71.02	MiB

L2 ARC Breakdown:				23.78m
	Hit Ratio:			34.25%	8.15m
	Miss Ratio:			65.75%	15.64m
	Feeds:					63.47k

L2 ARC Buffer:
	Bytes Scanned:				65.51	TiB
	Buffer Iterations:			63.47k
	List Iterations:			4.06m
	NULL List Iterations:			64.89k

L2 ARC Writes:
	Writes Sent:			100.00%	29.89k

------------------------------------------------------------------------

File-Level Prefetch: (HEALTHY)

DMU Efficiency:					1.24b
	Hit Ratio:			64.29%	798.62m
	Miss Ratio:			35.71%	443.54m

	Colinear:				443.54m
	  Hit Ratio:			0.00%	20.45k
	  Miss Ratio:			100.00%	443.52m

	Stride:					772.29m
	  Hit Ratio:			99.99%	772.21m
	  Miss Ratio:			0.01%	81.30k

DMU Misc:
	Reclaim:				443.52m
	  Successes:			0.05%	220.47k
	  Failures:			99.95%	443.30m

	Streams:				26.42m
	  +Resets:			0.05%	12.73k
	  -Resets:			99.95%	26.41m
	  Bogus:				0

------------------------------------------------------------------------

VDEV cache is disabled

------------------------------------------------------------------------

ZFS Tunables (sysctl):
	kern.maxusers                           384
	vm.kmem_size                            66662760448
	vm.kmem_size_scale                      1
	vm.kmem_size_min                        0
	vm.kmem_size_max                        329853485875
	vfs.zfs.l2c_only_size                   5242113536
	vfs.zfs.mfu_ghost_data_lsize            178520064
	vfs.zfs.mfu_ghost_metadata_lsize        6486959104
	vfs.zfs.mfu_ghost_size                  6665479168
	vfs.zfs.mfu_data_lsize                  11863127552
	vfs.zfs.mfu_metadata_lsize              123386368
	vfs.zfs.mfu_size                        12432947200
	vfs.zfs.mru_ghost_data_lsize            14095171584
	vfs.zfs.mru_ghost_metadata_lsize        8351076864
	vfs.zfs.mru_ghost_size                  22446248448
	vfs.zfs.mru_data_lsize                  2076449280
	vfs.zfs.mru_metadata_lsize              4655490560
	vfs.zfs.mru_size                        7074721792
	vfs.zfs.anon_data_lsize                 0
	vfs.zfs.anon_metadata_lsize             0
	vfs.zfs.anon_size                       1605632
	vfs.zfs.l2arc_norw                      1
	vfs.zfs.l2arc_feed_again                1
	vfs.zfs.l2arc_noprefetch                1
	vfs.zfs.l2arc_feed_min_ms               200
	vfs.zfs.l2arc_feed_secs                 1
	vfs.zfs.l2arc_headroom                  2
	vfs.zfs.l2arc_write_boost               52428800
	vfs.zfs.l2arc_write_max                 26214400
	vfs.zfs.arc_meta_limit                  16398159872
	vfs.zfs.arc_meta_used                   16398120264
	vfs.zfs.arc_min                         8199079936
	vfs.zfs.arc_max                         32796319744
	vfs.zfs.dedup.prefetch                  1
	vfs.zfs.mdcomp_disable                  0
	vfs.zfs.write_limit_override            0
	vfs.zfs.write_limit_inflated            206088929280
	vfs.zfs.write_limit_max                 8587038720
	vfs.zfs.write_limit_min                 33554432
	vfs.zfs.write_limit_shift               3
	vfs.zfs.no_write_throttle               0
	vfs.zfs.zfetch.array_rd_sz              1048576
	vfs.zfs.zfetch.block_cap                256
	vfs.zfs.zfetch.min_sec_reap             2
	vfs.zfs.zfetch.max_streams              8
	vfs.zfs.prefetch_disable                0
	vfs.zfs.mg_alloc_failures               12
	vfs.zfs.check_hostid                    1
	vfs.zfs.recover                         0
	vfs.zfs.txg.synctime_ms                 1000
	vfs.zfs.txg.timeout                     5
	vfs.zfs.vdev.cache.bshift               16
	vfs.zfs.vdev.cache.size                 0
	vfs.zfs.vdev.cache.max                  16384
	vfs.zfs.vdev.write_gap_limit            4096
	vfs.zfs.vdev.read_gap_limit             32768
	vfs.zfs.vdev.aggregation_limit          131072
	vfs.zfs.vdev.ramp_rate                  2
	vfs.zfs.vdev.time_shift                 6
	vfs.zfs.vdev.min_pending                4
	vfs.zfs.vdev.max_pending                128
	vfs.zfs.vdev.bio_flush_disable          0
	vfs.zfs.cache_flush_disable             0
	vfs.zfs.zil_replay_disable              0
	vfs.zfs.zio.use_uma                     0
	vfs.zfs.snapshot_list_prefetch          0
	vfs.zfs.version.zpl                     5
	vfs.zfs.version.spa                     28
	vfs.zfs.version.acl                     1
	vfs.zfs.debug                           0
	vfs.zfs.super_owner                     0



More information about the freebsd-fs mailing list