Improving ZFS performance for large directories

Wed Jan 30 00:06:06 UTC 2013

On Jan 29, 2013, at 5:42 PM, Matthew Ahrens <mahrens at delphix.com> wrote:

> On Tue, Jan 29, 2013 at 3:20 PM, Kevin Day <toasty at dragondata.com> wrote:
> I'm prepared to try an L2arc cache device (with secondarycache=metadata),
> 
> You might first see how long it takes when everything is cached.  E.g. by doing this in the same directory several times.  This will give you a lower bound on the time it will take (or put another way, an upper bound on the improvement available from a cache device).
>  

Doing it twice back-to-back makes a bit of difference but it's still slow either way.

After not touching this directory for about 30 minutes:

# time ls -l >/dev/null
0.773u 2.665s 0:18.21 18.8%	35+2749k 3012+0io 0pf+0w

Immediately again:

# time ls -l > /dev/null
0.665u 1.077s 0:08.60 20.1%	35+2719k 556+0io 0pf+0w

18.2 vs 8.6 seconds is an improvement, but even the 8.6 seconds is longer than what I was expecting.

> 
> For a specific filesystem, nothing comes to mind, but I'm sure you could cobble something together with zdb.  There are several tools to determine the amount of metadata in a ZFS storage pool:
> 
>  - "zdb -bbb <pool>"
>      but this is unreliable on pools that are in use

I tried this and it consumed >16GB of memory after about 5 minutes so I had to kill it. I'll try it again during our next maintenance window where it can be the only thing running.

>  - "zpool scrub <pool>; <wait for scrub to complete>; echo '::walk spa|::zfs_blkstats' | mdb -k"
>     the scrub is slow, but this can be mitigated by setting the global variable zfs_no_scrub_io to 1.  If you don't have mdb or equivalent debugging tools on freebsd, you can manually look at <spa_t>->spa_dsl_pool->dp_blkstats.
> 
> In either case, the "LSIZE" is the size that's required for caching (in memory or on a l2arc cache device).  At a minimum you will need 512 bytes for each file, to cache the dnode_phys_t.

Okay, thanks a bunch. I'll try this on the next chance I get too.

I think some of the issue is that nothing is being allowed to stay cached long. We have several parallel rsyncs running at once that are basically scanning every directory as fast as they can, combined with a bunch of rsync, http and ftp clients. I'm guessing with all that activity things are getting shoved out pretty quickly.