The journalling file system saga
Peter Schuller
peter.schuller at infidyne.com
Thu May 13 13:45:06 PDT 2004
Hello,
> I had to build a storage system this week with a capacity of 1.6TB.
> Regrettfully I decided to use Linux with XFS as the thought of waiting
> for fsck to complete in the event of a problem makes me wince. I
> experimented with FreeBSD, using two 800GB partitions and things like
> that, but in the end it comes back to the fsck if for any reason the
> machine goes down uncleanly.
I share your reaction to the thought of fsck-after-crash, though I have come
to appreciate softupdates lately after an obscene amount of googling.
IMO the primary advantage to soft updates compared to journaling is that it
allows good performance without write caching, since write operations can be
deferred. The good part about this is that one can achieve good performance
with write caching disabled on the drive/RAID, while journaling will be
either slower with write caching turned off, or unsafe with it turned on.
The question is whether that applies to data aswell as meta-data. I have not
yet found any information as to whether soft updates guarantees the order of
non-meta data (or: "Is it safe to run PostgreSQL with soft updates?"). If
anyone reading this has a clue, I'd love to hear it.
Unfortunately there are problems with soft updates, for me as a user. One
problem is degraded performance with bgfsck, that you have already mentioned.
Another problem is that bgfsck seems to be unsupported on the root filesystem
(something which I am trying to fix, but it's going slowly due to lack of
knowledge of FreeBSD aswell as lack of time).
Yet another problem is that an fsync() no longer guarantees that data is on
disk, even with write caching disabled on the media. This doesn't break
things like PostgreSQL provided that the order of writes is preserved, but it
does break things like MTA:s that want to guarantee that critical data has
been commited to persistent storage before signaling success to an external
entity (SMTP client).
A very big issue is that soft updates addresses multiple problems - but it's
an all-or-nothing choice. I can get good performance running "safely" (in
some circumstances) by using soft updates, but if I need safety for an MTA I
need to turn it off. But turning soft updates off does not only have the
effect of decreasing performance, it *ALSO* creates the need for a full fsck
after an unclean shutdown. But what if I need safety *AND* do not wish to
have a 30 minute boot-up time? (Or in your case with 1.6 TB, I would imagine
that's a LOT more than just 30 minutes...)
A good solution might be to support *both* some kind of journaling/logging and
soft updates. But to me that is still just a work-around for a broken
foundation.
I believe the fundamental problem lies in the ambiguity of fsync(). The same
syscall is used to achieve different effects. A database like PostgreSQL with
write-ahead logging (WAL) is concerned with making sure certain data is
written before additional modifications are made (though see below). So it
uses fsynch() to make sure everything is written before proceeding - thus
causing a degredation in performance.
But then comes qmail which needs to guarantee the data in question is *on
disk*, and also uses fsynch(). This time the intended effect is specifically
the goal of synch(). In the former case the intended effect was an implicit
side-effect.
PostgreSQL can be honored in terms of avoiding corruption (but not in terms of
guaranteeing a transaction is commited to persistent storage when it returns)
by softupdates provided that both meta-data and all other data is guaranteed
to be written in the correct order (though again I don't know if this is the
case). But qmail is not served by this. A filesystem that fulfills the
requirements of qmail would also fulfill the requirements of PostgreSQL - but
it would also unnecessarily decrease performance.
> Is anyone remotely interested in this?
Yes, for the reasons mentioned below, and strictly for practical personal use
because I'd love to be able to share data between FreeBSD and Linux ;)
--
/ Peter Schuller, InfiDyne Technologies HB
PGP userID: 0xE9758B7D or 'Peter Schuller <peter.schuller at infidyne.com>'
Key retrieval: Send an E-Mail to getpgpkey at scode.org
E-Mail: peter.schuller at infidyne.com Web: http://www.scode.org
More information about the freebsd-questions
mailing list