ZFS directory with a large number of files

Tue Aug 2 09:30:23 UTC 2011

On Tue, Aug 2, 2011 at 9:39 AM, seanrees at gmail.com <seanrees at gmail.com> wrote:
> On my FreeBSD 8.2-S machine (built circa 12th June), I created a
> directory and populated it over the course of 3 weeks with about 2
> million individual files. As you might imagine, a 'ls' of this
> directory took quite some time.

What actually takes some time here isn't zfs, but the sorting
of ls(1). Usually, running ls(1) with -f (Output is not sorted)
speeds up things enormously.

> The files were conveniently named with a timestamp in the filename
> (still images from a security camera, once per second) so I've since
> moved them all to timestamped directories (yyyy/MM/dd/hh/mm). What I
> found though was the original directory the images were in is still
> very slow to ls -- and it only has 1 file in it, another directory.

That is strange... and shouldn't happen. According to the ZFS
Performance Wiki [1], operations on ZFS file systems are supposed to
be pretty efficient:

  Concurrent, constant time directory operations

  Large directories need constant time operations (lookup, create,
  delete, etc). Hot directories need concurrent operations. ZFS uses
  extensible hashing to solve this. Block based, amortized growth
  cost, short chains for constant time ops, per-block locking for high
  concurrency. A caveat is that readir returns entries in hash order.

  Directories are implemented via the ZFS Attribute Processor (ZAP) in
  ZFS. ZAP can be used to arbitrary name value pairs. ZAP uses two
  algorithms are optimized for large lists (large directories) and
  small lists (attribute lists).

  The ZAP implementation is in zap.c and zap_leaf.c. Each directory is
  maintained as a table of pointers to constant sized buckets holding
  a variable number of entries. Each directory record is 16k in
  size. When this block gets full, a new block of size next power of
  two is allocated.

  A directory starts off as a microzap, and then upgraded to a fat zap
  (via mzap_upgrade) if the size of the name exceeds MZAP_NAME_LEN (
  MZAP_ENT_LEN - 8 - 4 - 2) or 50 or if the size of the microzap
  exceeds MZAP_MAX_BLKSZ (128k)

[1]: http://www.solarisinternals.com/wiki/index.php/ZFS_Performance

I don't know what's going on there, but someone with ZFS internals
expertise may want to have a closer look.

> To clarify:
> % ls second
> [lots of time and many many files enumerated]
> % # rename files using rename script
> % ls second
> [wait ages]
> 2011 dead
> % mkdir second2 && mv second/2011 second2
> % ls second2
> [fast!]
> 2011
> % ls second
> [still very slow]
> dead
> % time ls second
> dead/
> gls -F --color  0.00s user 1.56s system 0% cpu 3:09.61 total
>
> (timings are similar for /bin/ls)
>
> This data is stored on a striped ZFS pool (version 15, though the
> kernel reports version 28 is available but zpool upgrade seems to
> disagree), 2T in size. I've run zpool scrub with no effect. ZFS is
> busily driving the disks away; my iostat monitoring has all three
> drives in the zpool running at 40-60% busy for the duration of the ls
> (it was quiet before).
>
> I've attached truss to the ls process. It spends a lot of time here:
> fstatfs(0x5,0x7fffffffe0d0,0x800ad5548,0x7fffffffdfd8,0x0,0x0) = 0 (0x0)

That's a very good hint indeed!

> I'm thinking there's some old ZFS metadata that it's looking into, but
> I'm not sure how to best dig into this to understand what's going on
> under the hood.
>
> Can anyone perhaps point me the right direction on this?
>
> Thanks,
>
> Sean

Regards,
-cpghost.

-- 
Cordula's Web. http://www.cordula.ws/