Millions of small files: best filesystem / best options

Devin Teske devin.teske at fisglobal.com
Mon May 28 16:25:25 UTC 2012


On May 28, 2012, at 6:21 AM, Alessio Focardi wrote:

> Hi,
> 
> I'm pretty new to BSD, but I do have some knowledge in Linux. 
> 
> I'm looking for some advice to efficiently pack millions of small files (200 bytes or less) over a freebsd fs.
> 

This is something we've been doing (on FreeBSD) for almost 15 years now (starting with FreeBSD 2.1.5; now 8.1, and soon 8.3). We started with UFS1 and have been evaluating ZFS (we don't think SU+J is ready for production at this scale yet). We haven't used UFS2 yet but have no doubt that it's just as strong as UFS1.


> Those files will be stored in an hierarchical directory structure to limit the number of files for any directory and so (I hope!) speed up file lookups/deletion.
> 

FreeBSD handles this wonderfully thanks to all the people that have put in time and effort over the years.

Ten years ago (circa FreeBSD 4.0-RELEASE) people at the company I work at now, back then commonly:
- fiddled with the dirhash sysctl(8) MIB
- modified fsck(8) to make it more efficient
- modified tar(1) to handle high numbers of hard-links without falling over
- modified du(1) in a similar fashion to tar above
- more; all in the name of doing what you're describing (but on steroids)

but all those patches eventually made their way back into FreeBSD and we generally haven't had to worry about even tens-of-millions of JPEG-sized (~200KB) files on a RAID formatted in UFS (1 or 2) since, say, FreeBSD-6 (but someone in FS will be able to give a more accurate release when things really started to stabilize). Either way, 6, 7, 8, and 9 all had very stable filesystems w/respect to millions-of-small-files.



> I have to say that I'm looking at fbsd for my project because both UFS2 and ZFS have some flavour of "block suballocation" "tail packing" "variable record size", at least documentation says so.
> 
> My hope is to waste as less space as possible, even sacrificing some speed: can't use a full block for a single file: I will end up wasting 99% of the space!
> 

I wasn't aware that FreeBSD was unique in this respect, but yes, FreeBSD has a block size and a fragment size. While formatting a UFS filesystem you can specify these sizes with the "-b SIZE" and "-f SIZE" arguments to newfs(8), for example:

	newfs -b 16384 -f 2048 /dev/da0s1a

Will format a RAID (/dev/da0s1a) with a 16K block size but a 2K fragment size. Using touch(1) to create an empty file will use only 2K of disk space. This is the "block suballocation" you speak of. The above parameters are exactly what we use formatting our RAIDs when storing millions of JPEG-sized (~200KB as you describe) files.


> 
> Do someone got some experience in a similar situation, and it's willing to give some advice on which fs I should choose and how to tune it for this particular scenario?
> 

Choose your hardware wisely. After you have chosen your hardware wisely, set it up even more wisely.

For example, we go threw a multi-day burn-in process on RAIDs that have double-digit numbers of disks.

Be smart about how you allocate the logical versus physical media in a way that reduces bottlenecks.

Go through any/all failure/recovery test procedures before putting data on the device if you don't already trust the hardware. Trust in the hardware is very important. If you don't trust your hardware's battery backed DIMM for write-back cache (for example), then I have one very important recommendation when it comes to UFS: disable the SoftUpdates feature.

Disabling SoftUpdates on a UFS filesystem cause a huge performance impact but it will allow you to sleep at night. In 15 years, UFS has never barfed on us unless maybe 3 memorable events in which entire groups-of-individuals can recount with amazing clarity debugging horked filesystems late in the night after SoftUpdates ate the kid's homework (leaving tens- to hundreds-of-thousands of files in lost+found). We routinely use SoftUpdates on _other_ UFS filesystems (like system partitions including "/var" and "/usr"), but _never_ on the RAIDs housing those millions-of-little-files.

Other's mileage may vary.

> 
> Thank you very much, appreciated!
> 
> 

No problem.


> ps
> 
> I know that probably a database will fit better in this situation, but in my case I can't take that route :(
> 

Not necessarily. A database has the immediate-and-clear down-side that if one bit in the database changes, a backup tool like bacula has to backup the entire database again.

…and the database administrator is not necessarily the same person as the backup administrator (just sayin').
-- 
Devin

_____________
The information contained in this message is proprietary and/or confidential. If you are not the intended recipient, please: (i) delete the message and all copies; (ii) do not disclose, distribute or use the message in any manner; and (iii) notify the sender immediately. In addition, please be aware that any message addressed to our domain is subject to archiving and review by persons other than the intended recipient. Thank you.


More information about the freebsd-fs mailing list