Journaling UFS with gjournal.

Mon Jun 19 13:13:59 UTC 2006

Hello.

For the last few months I have been working on gjournal project.
To stop confusion right here, I want to note, that this project is not
related to gjournal project on which Ivan Voras was working on the
last SoC (2005).

The lack of journaled file system in FreeBSD was a tendon of achilles
for many years. We do have many file systems, but none with journaling:
- ext2fs (journaling is in ext3fs),
- XFS (read-only),
- ReiserFS (read-only),
- HFS+ (read-write, but without journaling),
- NTFS (read-only).

GJournal was designed to journal GEOM providers, so it actually works
below file system layer, but it has hooks which allow to work with
file systems. In other words, gjournal is not file system-depended,
it can work probably with any file system with minimum knowledge
about it. I implemented only UFS support.

The patches are here:

	http://people.freebsd.org/~pjd/patches/gjournal.patch (for HEAD)
	http://people.freebsd.org/~pjd/patches/gjournal6.patch (for RELENG_6)

To patch your sources you need to:

	# cd /usr/src
	# mkdir sbin/geom/class/journal sys/geom/journal sys/modules/geom/geom_journal
	# patch < /path/to/gjournal.patch

Add 'options UFS_GJOURNAL' to your kernel configuration file and
recompile kernel and world.

How it works (in short). You may define one or two providers which
gjournal will use. If one provider is given, it will be used for both -
data and journal. If two providers are given, one will be used for data
and one for journal.
Every few seconds (you may define how many) journal is terminated and
marked as consistent and gjournal starts to copy data from it to the
data provider. In the same time new data are stored in new journal.
Let's call the moment in which journal is terminated as "journal switch".
Journal switch looks as follows:
1. Start journal switch if we have timeout or if we run out of cache.
   Don't perform journal switch if there were no write requests.
2. If we have file system, synchronize it.
3. Mark file system as clean.
4. Block all write requests to the file system.
5. Terminate the journal.
6. Eventually wait if copying of the previous journal is not yet
   finished.
7. Send BIO_FLUSH request (if the given provider supports it).
8. Mark new journal position on the journal provider.
9. Unblock write requests.
10. Start copying data from the terminated journal to the data provider.

There were few things I needed to implement outside gjournal to make it
work reliable:

- The BIO_FLUSH request. Currently we have three I/O requests: BIO_READ,
BIO_WRITE and BIO_DELETE. I added BIO_FLUSH, which means "flush your
write cache". The request is send always with the biggest bio_offset set
(mediasize of the destination provider), so it will work properly with
bioq_disksort(). The caller need to stop further I/O requests before
BIO_FLUSH return, so we don't have starvation effect.
The hard part is that is has to be implemented in every disk driver,
because flushing the cache is driver-depended operation. I implemented
it for ata(4) disks and amr(4). The good news is that it's easy.
GJournal can also work with providers that don't support BIO_FLUSH and
in my power-failure tests it worked well (no problems), but it depend
on fact, that gjournal cache is bigger than the controller cache, so it
is hard to call it reliable.
You can read in documentation to many journaled file systems, that you
should turn off write cache if you want to use it. This is not the case
for gjournal (especially when your disk driver does support BIO_FLUSH).

The 'gjournal' mount option. To implement gjournal support in UFS I
needed to change the way of how deleted, but still open objects are
handled. Currently when file or directory is open and we deleted last
name which reference it, it will still be usable by those who keep it
open. When the last consumer closes it, the inode and blocks are freed.
On journal switch I cannot leave such objects, because after a crash
fsck(8) is not used to check the file system, so inode and blocks will
never be freed. When file system is mounted with 'gjournal' mount
option, such objects are not removed when they are open. When last
name is deleted, the file/directory is moved to the .deleted/
directory and removed from there on last close.
This way, I can just clean the .deleted/ directory after a crash at
mount time.

Quick start:

	# gjournal label /dev/ad0
	# gjournal load
	# newfs /dev/ad0.journal
	# mount -o async,gjournal /dev/ad0.journal /mnt
	(yes, with gjournal 'async' is safe)

Now, after a power failure or system crash no fsck is needed (yay!).

There are two hacks in the current implementation, which I'd like to
reimplement. First is how 'gjournal' mount option is implemented.
There is a garbage collector thread which is responsible for deleting
objects from .deleted/ directory and it is using full paths. Because
of this when your mount point is /foo/bar/baz and you rename 'bar' to
something else, it will not work. This is not what is often done, but
definitely should be fixed and I'm working on it. The second hack is
related to communication between gjournal and file system. GJournal
decides when to make the switch and has to find file system which is
mounted on it. Looking for this file system is not nice and should be
reimplemented.

There are some additional goods which came with gjournal. For example
if gjournal is configured over gmirror or graid3, even on power failure
or system crash, there is no need to synchronize mirror/raid3 device,
because data will be consistent.

I spend a lot of time working on gjournal optimization. Because I've
few seconds before the data hit the data provider I can perform things
like combining smaller write requests into larger once, ignoring data
written twice to the same place, etc.
Because of this, operations on small files are quite fast. On the other
hand, operations on large files are slower, because I need to write the
data twice and there is no place for optimization. Here are some numbers.
gjournal(1) - the data provider and the journal provider on the same disk
gjournal(2) - the data provider and the journal provider on separate
	disks

Copying one large file:
UFS:		8s
UFS+SU:		8s
gjournal(1):	16s
gjournal(2):	14s

Copying eight large files in parallel:
UFS:		120s
UFS+SU:		120s
gjournal(1):	184s
gjournal(2):	165s

Untaring eight src.tgz in parallel:
UFS:		791s
UFS+SU:		650s
gjournal(1):	333s
gjournal(2):	309s

Reading. grep -r on two src/ directories in parallel:
UFS:		84s
UFS+SU:		138s
gjournal(1):	102s
gjournal(2):	89s

As you can see, even on one disk, untaring eight src.tgz is two times
faster than UFS+SU. I've no idea why gjournal is faster in reading.

There are a bunch of sysctls to tune gjournal (kern.geom.journal tree).

When only one provider is given for both data and journal, the journal
part is placed at the end of the provider, so one can use file system
without journaling. If you use such configuration (one disk), it is
better for performance to place journal before data, so you may want to
create two partitions (eg. 2GB for ad0a and the rest for ad0d) and
create gjournal this way:

	# gjournal label ad0d ad0a

Enjoy!

The work was sponsored by home.pl (http://home.pl).

The work was made by Wheel LTD (http://www.wheel.pl).
The work was tested in the netperf cluster.

I want to thank Alexander Kabaev (kan@) for the help with VFS and
Mike Tancsa for test hardware.

-- 
Pawel Jakub Dawidek                       http://www.wheel.pl
pjd at FreeBSD.org                           http://www.FreeBSD.org
FreeBSD committer                         Am I Evil? Yes, I Am!
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 187 bytes
Desc: not available
Url : http://lists.freebsd.org/pipermail/freebsd-current/attachments/20060619/3ead960b/attachment.pgp