ZFS and large directories - caveat report

Thu Jul 21 20:07:21 UTC 2011

On Thu, Jul 21, 2011 at 12:29 PM, Ivan Voras <ivoras at freebsd.org> wrote:
> On 21 July 2011 20:15, Artem Belevich <art at freebsd.org> wrote:
>> On Thu, Jul 21, 2011 at 9:38 AM, Ivan Voras <ivoras at freebsd.org> wrote:
>>> On 21 July 2011 17:50, Freddie Cash <fjwcash at gmail.com> wrote:
>>>> On Thu, Jul 21, 2011 at 8:45 AM, Ivan Voras <ivoras at freebsd.org> wrote:
>>>>>
>>>>> Is there an equivalent of UFS dirhash memory setting for ZFS? (i.e. the
>>>>> size of the metadata cache)
>>>>
>>>> vfs.zfs.arc_meta_limit
>>>>
>>>> This sets the amount of ARC that can be used for metadata.  The default is
>>>> 1/8th of ARC, I believe.  This setting lets you use "primarycache=all"
>>>> (store metadata and file data in ARC) but then tune how much is used for
>>>> each.
>>>>
>>>> Not sure if that will help in your case or not, but it's a sysctl you can
>>>> play with.
>>>
>>> I don't think that it works, or at least is not as efficient as dirhash:
>>>
>>> www:~> sysctl -a | grep meta
>>> kern.metadelay: 28
>>> vfs.zfs.mfu_ghost_metadata_lsize: 129082368
>>> vfs.zfs.mfu_metadata_lsize: 116224
>>> vfs.zfs.mru_ghost_metadata_lsize: 113958912
>>> vfs.zfs.mru_metadata_lsize: 16384
>>> vfs.zfs.anon_metadata_lsize: 0
>>> vfs.zfs.arc_meta_limit: 322412800
>>> vfs.zfs.arc_meta_used: 506907792
>>> kstat.zfs.misc.arcstats.demand_metadata_hits: 4471705
>>> kstat.zfs.misc.arcstats.demand_metadata_misses: 2110328
>>> kstat.zfs.misc.arcstats.prefetch_metadata_hits: 27
>>> kstat.zfs.misc.arcstats.prefetch_metadata_misses: 51
>>>
>>> arc_meta_used is nearly 500 MB which should be enough even in this
>>> case. With filenames of 32 characters, all the filenames alone for
>>> 130,000 files in a directory take about 4 MB - I doubt the ZFS
>>> introduces so much extra metadata it doesn't fit in 500 MB.
>>
>> For what it's worth, 500K files in one directory seems to work
>> reasonably well on my box running few weeks old 8-stable (quad core
>> 8GB RAM, ~6GB ARC), ZFSv28 pool on a 2-drive mirror + 50GB L2ARC.
>>
>> $ time perl -e 'use Fcntl; for $f  (1..500000)
>> {sysopen(FH,"f$f",O_CREAT); close(FH);}'
>> perl -e  >| /dev/null  2.26s user 39.17s system 96% cpu 43.156 total
>>
>> $ time find . |wc -l
>>  500001
>> find .  0.16s user 0.33s system 99% cpu 0.494 total
>>
>> $ time find . -ls |wc -l
>>  500001
>> find . -ls  1.93s user 12.13s system 96% cpu 14.643 total
>>
>> time find . |xargs -n 100 rm
>> find .  0.22s user 0.28s system 0% cpu 2:45.12 total
>> xargs -n 100 rm  1.25s user 58.51s system 36% cpu 2:45.61 total
>>
>> Deleting files resulted in a constant stream of writes to hard drives.
>> I guess file deletion may end up up being a synchronous write
>> committed to ZIL right away. If that's indeed the case, small slog on
>> SSD could probably speed up file deletion a bit.
>
> That's a very interesting find.
>
> Or maybe the issue is fragmentation: could you modify the script
> slightly to create files in about 50 directories in parallel (i.e.
> create in dir1, create in dir2, create in dir3... create in dir 50,
> then again create in dir1, create in dir2...)?

Scattering across 50 directories works about as fast:

$ time perl -e 'use Fcntl; $dir = 0; for $f  (1..500000)
{sysopen(FH,"$dir/f$f",O_CREAT); close(FH); $dir=($dir+1) % 50}'
>|/dev/null
perl -e  >| /dev/null  2.77s user 38.31s system 85% cpu 47.829 total

$ time find . |wc -l
  500051
find .  0.16s user 0.36s system 29% cpu 1.787 total

$ time find . -ls |wc -l
  500051
find . -ls  1.75s user 11.33s system 92% cpu 14.196 total

$ time find . -name f\* | xargs -n 100 rm
find . -name f\*  0.17s user 0.35s system 0% cpu 3:23.44 total
xargs -n 100 rm  1.35s user 52.82s system 26% cpu 3:23.75 total

>
> Could you for the sake of curiosity upgrate this system to the latest
> 8-stable and retry it?

I'm currently running 8.2-STABLE r223055. The log does not show
anything particularly interesting committed to ZFS code since then.
There was LBOLT overflow fix, but it should not be relevant in this
case. I do plan to upgrade the box, though it's not going to happen
for another week or so. If the issue is still relevant then, I'll be
happy to re-run the test.

--Artem