Bad performance when accessing a lot of small files

Fri Dec 21 12:17:53 PST 2007

* Alexandre Biancalana <biancalana at gmail.com> [071219 11:35] wrote:
> Hi List,
> 
>   I have a backup server running FreeBSD 7-BETA3. The cpu is CPU:
> Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz, 3GB Ram, 10x 500GB
> SATA, Areca 1231-ML, the filesystem used to backup my other servers
> locally is build on top of ARC-1231, 4TB (32k stripe) zfs filesystem
> with gzip compression.
> 
> This machine receive backups from ~30 servers, (of all kinds and
> sizes, databases, fileservers, image servers, webservers, etc) all
> night, write the last day in LTO-3 tapes and store some days older
> days in disk.
> 
> The behavior that I'm observing and that want your help is when the
> system is accessing some directory with many small files ( directories
> with ~ 1 million of ~30kb files), the performance is very poor.

There is a lot of very good tuning advice in this thread, however
one thing to note is that having ~1 million files in a directory
is not a very good thing to do on just about any filesystem.

One trick that a lot of people do is hashing the directories themselves
so that you use some kind of computation to break this huge dir into
multiple smaller dirs.

If you can figure out a hashing algorithm, that may help you.

For instance, if you tell sendmail to use "/var/spool/mq*"
for its mail spool and you happen to have 256 directories
under "/var/spool/" named "mq000" through "mq256" it will
randomly pick a directory to dump a file in.

This makes the performance a lot better.

For one million files you can probably do a two level hash,
you just have to figure out a good hashing algorithm.

If you you can describe the data, I may be able to help
you come up with a hashing algorithm for it.

-Alfred