UFS2 tuning for heterogeneous 4TB file system

Sun Jul 26 14:20:05 UTC 2009

On Sun, Jul 26, 2009 at 3:56 AM, b. f.<bf1783 at googlemail.com> wrote:
>>The file system in question will not have a common file size (which is
>>what, as I understand, bytes per inode should be tuned for). There
>>will be many small files (< 10 KB) and many large ones (> 500 MB). A
>>similar, in terms of content, 2TB ntfs file system on another server
>>has an average file size of about 26 MB with 59,246 files.
>
> Ordinarily, it may have a large variation in file sizes,  but can you
> intervene, and segregate large and small files in separate
> filesystems, so that you can optimize the settings for each
> independently?

That's a good idea, but the problem is that this raid array will grow
in the future as I add additional drives. As far as I know, a
partition can be expanded using growfs, but it cannot be moved to a
higher address (with any "standard" tools). So if I create two
separate partitions for different file types, the first partition will
have to remain a fixed size. That would be problematic, since I cannot
easily predict how much space it would need initially and for all
future purposes (enough to store all the files, yet not waste space
that could otherwise be used for the second partition).

>>Ideally, I would prefer that small files do not waste more than 4 KB
>>of space, which is what you have with ntfs. At the same time, having
>>fsck running for days after an unclean shutdown is also not a good
>>option (I always disable background checking). From what I've gathered
>>so far, the two requirements are at the opposite ends in terms of file
>>system optimization.
>
> I gather you are trying to be conservative, but have you considered
> using gjournal(8)?  At least for the filesystems with many small
> files?  In that way, you could safely avoid the need for most if not
> all use of fsck(8), and, as an adjunct benefit, you would be able to
> operate on the small files more quickly:
>
> http://lists.freebsd.org/pipermail/freebsd-current/2006-June/064043.html
> http://www.freebsd.org/doc/en_US.ISO8859-1/articles/gjournal-desktop/article.html
>
> gjournal has a lower overhead than ZFS, and has proven to be fairly
> reliable.  Also, you can always unhook it and revert to plain UFS
> mounts easily.
>
> b.
>

Just fairly reliable? :)

I've done a bit of reading on gjournal and the main thing that's
preventing me from using it is the recency of implementation. I've had
a number of FreeBSD servers go down in the past due to power outages
and SoftUpdates with foreground fsck have never failed me. I have
never had a corrupt ufs2 partition, which is not something I can say
about a few linux servers with ext3.

Have there been any serious studies into how gjournal and SU deal with
power outages? By that I mean taking two identical machines, issuing
write operations, yanking the power cords, and then watching both
systems recover? I'm sure that gjournal will take less time to reboot,
but if this experiment is repeated a few hundred times I wonder what
the corruption statistics would be. Is there ever a case, for
instance, when the journal itself becomes corrupt because the power
was pulled in the middle of a metadata flush?

Basically, I have no experience with gjournal, poor experience with
other journaled file systems, and no real comparison between
reliability characteristics of gjournal and SoftUpdates, which have
served me very well in the past.

- Max