ZFS Optimizing for large directories and MANY files

Tue Jun 25 18:19:07 UTC 2019

I have been trying to understand various zfs sysctl settings once again
and how they might relate to optimizing a file server that has very few
big files, but MANY small ones (RELENG_12).... Sometimes directories
that get upwards of 30,000+ files and the odd time when some outside
user process breaks, 100,000+ files.  Obviously, throwing a LOT of RAM
at the problem helps. But are there any more tunings I can do ? So far,
I have adjusted

vfs.zfs.arc_meta_strategy=1
/|vfs.zfs.arc_meta_limit to 65% of ARC memory over the default 25%
on the zfs set in question, I have set primarycache=metadata
|/

/|Anything else I can do to bias towards a file system with MANY files
?  Unfortunately, I cant control the end users from dumping many files
in their single directories easily. I think the hit happens, when they
log in, do a dir, see what files they need to download, download and log
out. As long as that is cached, its not so bad. 
|/

/|
|/

/|Doing some simple tests on an imported version of the data set (on
slower spinning rust drives), something simple such as
|/

/|# time find . -type f -mtime -2d|/

/|takes 40 min after a cold boot.|/

/|Watching zfs disk IO, its super slow in terms of bandwidth, but gstat
shows the disks close to being pegged.  I guess the heads are thrashing
about inefficiently ?
|/

/|1{ryzenbsd12}# zpool iostat tmpdisk 1
               capacity     operations    bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
tmpdisk      301G  1.46T    335      1   899K  33.0K
tmpdisk      301G  1.46T    402      0  1.02M      0
tmpdisk      301G  1.46T    265      0   559K      0
tmpdisk      301G  1.46T    331      0   715K      0
tmpdisk      301G  1.46T    276      0   650K      0
tmpdisk      301G  1.46T    293      0   718K      0
tmpdisk      301G  1.46T    432      0  1.11M      0
tmpdisk      301G  1.46T    435      0  1.03M      0
tmpdisk      301G  1.46T    412      0  1.01M      0
tmpdisk      301G  1.46T    315      0   717K      0
tmpdisk      301G  1.46T    417      0  1.04M      0
tmpdisk      301G  1.46T    457      0  1.13M      0
tmpdisk      301G  1.46T    448      0  1.05M      0|/

/|top shows ARC steadily growing
|/

/|ARC: 5119M Total, 2128M MFU, 2361M MRU, 1608K Anon, 73M Header, 560M Other
     606M Compressed, 3902M Uncompressed, 6.43:1 Ratio
|/

/|
|/

/|stats show|/

/|
ARC Summary: (HEALTHY)
        Memory Throttle Count:                  0

ARC Misc:
        Deleted:                                238
        Recycle Misses:                         0
        Mutex Misses:                           0
        Evict Skips:                            1.04k

ARC Size:                               17.28%  5.20    GiB
        Target Size: (Adaptive)         100.00% 30.07   GiB
        Min Size (Hard Limit):          12.50%  3.76    GiB
        Max Size (High Water):          8:1     30.07   GiB

ARC Size Breakdown:
        Recently Used Cache Size:       50.00%  15.03   GiB
        Frequently Used Cache Size:     50.00%  15.03   GiB

ARC Hash Breakdown:
        Elements Max:                           247.31k
        Elements Current:               100.00% 247.31k
        Collisions:                             7.20k
        Chain Max:                              3
        Chains:                                 6.99k

------------------------------------------------------------------------

ARC Efficiency:                                 2.53m
        Cache Hit Ratio:                88.31%  2.23m
        Cache Miss Ratio:               11.69%  295.48k
        Actual Hit Ratio:               88.24%  2.23m

        Data Demand Efficiency:         87.76%  20.01k

        CACHE HITS BY CACHE LIST:
          Anonymously Used:             0.08%   1.69k
          Most Recently Used:           21.64%  483.05k
          Most Frequently Used:         78.28%  1.75m
          Most Recently Used Ghost:     0.00%   0
          Most Frequently Used Ghost:   0.00%   0

        CACHE HITS BY DATA TYPE:
          Demand Data:                  0.79%   17.56k
          Prefetch Data:                0.00%   0
          Demand Metadata:              99.14%  2.21m
          Prefetch Metadata:            0.08%   1.69k

        CACHE MISSES BY DATA TYPE:
          Demand Data:                  0.83%   2.45k
          Prefetch Data:                0.00%   0
          Demand Metadata:              18.79%  55.52k
          Prefetch Metadata:            80.38%  237.51k|/

/|
|/

/|
|/

/|Once a single trip through the file system via find is done, top shows|/

/|ARC: 10G Total, 7161M MFU, 467M MRU, 1600K Anon, 191M Header, 2842M Other
     1647M Compressed, 11G Uncompressed, 7.12:1 Ratio
|/

/|find, on the second iteration only takes|/

/|0{ryzenbsd12}# time find . -type f -mtime -2d
./list.txt
./l
1.992u 69.557s 1:11.54 100.0%   35+177k 169144+0io 0pf+0w
0{ryzenbsd12}#
|/

/|and the stats look appropriately better too|/

/|
ARC Summary: (HEALTHY)
        Memory Throttle Count:                  0

ARC Misc:
        Deleted:                                238
        Recycle Misses:                         0
        Mutex Misses:                           0
        Evict Skips:                            1.04k

ARC Size:                               34.11%  10.26   GiB
        Target Size: (Adaptive)         100.00% 30.07   GiB
        Min Size (Hard Limit):          12.50%  3.76    GiB
        Max Size (High Water):          8:1     30.07   GiB

ARC Size Breakdown:
        Recently Used Cache Size:       50.00%  15.03   GiB
        Frequently Used Cache Size:     50.00%  15.03   GiB

ARC Hash Breakdown:
        Elements Max:                           688.43k
        Elements Current:               100.00% 688.43k
        Collisions:                             53.65k
        Chain Max:                              4
        Chains:                                 50.50k

------------------------------------------------------------------------

ARC Efficiency:                                 56.03m
        Cache Hit Ratio:                98.07%  54.94m
        Cache Miss Ratio:               1.93%   1.08m
        Actual Hit Ratio:               97.64%  54.71m

        Data Demand Efficiency:         86.21%  21.97k

        CACHE HITS BY CACHE LIST:
          Anonymously Used:             0.43%   237.54k
          Most Recently Used:           12.19%  6.70m
          Most Frequently Used:         87.37%  48.01m
          Most Recently Used Ghost:     0.00%   0
          Most Frequently Used Ghost:   0.00%   0

        CACHE HITS BY DATA TYPE:
          Demand Data:                  0.03%   18.94k
          Prefetch Data:                0.00%   0
          Demand Metadata:              95.72%  52.59m
          Prefetch Metadata:            4.24%   2.33m

        CACHE MISSES BY DATA TYPE:
          Demand Data:                  0.28%   3.03k
          Prefetch Data:                0.00%   0
          Demand Metadata:              50.84%  550.75k
          Prefetch Metadata:            48.88%  529.54k

------------------------------------------------------------------------
|/

/|Anything else to adjust ? I was going to use RAID1+0 for the dataset
on SSDs.  Should I bother with an NVME drive for L2ARC caching ?  On my
test box, I can sort of approximate how much RAM I need for metadata
(11G it seems), is there a better programatic way to find that value out ?
|/

/|    ---Mike
|/

/|
|/