The journalling file system saga

Peter Schuller peter.schuller at infidyne.com
Thu May 13 13:45:06 PDT 2004


Hello,

> I had to build a storage system this week with a capacity of 1.6TB.
> Regrettfully I decided to use Linux with XFS as the thought of waiting
> for fsck to complete in the event of a problem makes me wince. I
> experimented with FreeBSD, using two 800GB partitions and things like
> that, but in the end it comes back to the fsck if for any reason the
> machine goes down uncleanly.

I share your reaction to the thought of fsck-after-crash, though I have come 
to appreciate softupdates lately after an obscene amount of googling.

IMO the primary advantage to soft updates compared to journaling is that it 
allows good performance without write caching, since write operations can be 
deferred. The good part about this is that one can achieve good performance 
with write caching disabled on the drive/RAID, while journaling will be 
either slower with write caching turned off, or unsafe with it turned on.

The question is whether that applies to data aswell as meta-data. I have not 
yet found any information as to whether soft updates guarantees the order of 
non-meta data (or: "Is it safe to run PostgreSQL with soft updates?"). If 
anyone reading this has a clue, I'd love to hear it.

Unfortunately there are problems with soft updates, for me as a user. One 
problem is degraded performance with bgfsck, that you have already mentioned. 

Another problem is that bgfsck seems to be unsupported on the root filesystem 
(something which I am trying to fix, but it's going slowly due to lack of 
knowledge of FreeBSD aswell as lack of time).

Yet another problem is that an fsync() no longer guarantees that data is on 
disk, even with write caching disabled on the media. This doesn't break 
things like PostgreSQL provided that the order of writes is preserved, but it 
does break things like MTA:s that want to guarantee that critical data has 
been commited to persistent storage before signaling success to an external 
entity (SMTP client).

A very big issue is that soft updates addresses multiple problems - but it's 
an all-or-nothing choice. I can get good performance running "safely" (in 
some circumstances) by using soft updates, but if I need safety for an MTA I 
need to turn it off. But turning soft updates off does not only have the 
effect of decreasing performance, it *ALSO* creates the need for a full fsck 
after an unclean shutdown. But what if I need safety *AND* do not wish to 
have a 30 minute boot-up time? (Or in your case with 1.6 TB, I would imagine 
that's a LOT more than just 30 minutes...)

A good solution might be to support *both* some kind of journaling/logging and 
soft updates. But to me that is still just a work-around for a broken 
foundation.

I believe the fundamental problem lies in the ambiguity of fsync(). The same 
syscall is used to achieve different effects. A database like PostgreSQL with 
write-ahead logging (WAL) is concerned with making sure certain data is 
written before additional modifications are made (though see below). So it 
uses fsynch() to make sure everything is written before proceeding - thus 
causing a degredation in performance.

But then comes qmail which needs to guarantee the data in question is *on 
disk*, and also uses fsynch(). This time the intended effect is specifically 
the goal of synch(). In the former case the intended effect was an implicit 
side-effect.

PostgreSQL can be honored in terms of avoiding corruption (but not in terms of 
guaranteeing a transaction is commited to persistent storage when it returns) 
by softupdates provided that both meta-data and all other data is guaranteed 
to be written in the correct order (though again I don't know if this is the 
case). But qmail is not served by this. A filesystem that fulfills the 
requirements of qmail would also fulfill the requirements of PostgreSQL - but 
it would also unnecessarily decrease performance.

> Is anyone remotely interested in this?

Yes, for the reasons mentioned below, and strictly for practical personal use 
because I'd love to be able to share data between FreeBSD and Linux ;)

-- 
/ Peter Schuller, InfiDyne Technologies HB

PGP userID: 0xE9758B7D or 'Peter Schuller <peter.schuller at infidyne.com>'
Key retrieval: Send an E-Mail to getpgpkey at scode.org
E-Mail: peter.schuller at infidyne.com Web: http://www.scode.org



More information about the freebsd-questions mailing list