Journaling UFS with gjournal.

Mon Jun 19 18:32:28 UTC 2006

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Pawel Jakub Dawidek wrote:
> Hello.
> 
> For the last few months I have been working on gjournal project.
> To stop confusion right here, I want to note, that this project is not
> related to gjournal project on which Ivan Voras was working on the
> last SoC (2005).
> 
> The lack of journaled file system in FreeBSD was a tendon of achilles
> for many years. We do have many file systems, but none with journaling:
> - ext2fs (journaling is in ext3fs),
> - XFS (read-only),
> - ReiserFS (read-only),
> - HFS+ (read-write, but without journaling),
> - NTFS (read-only).
> 
> GJournal was designed to journal GEOM providers, so it actually works
> below file system layer, but it has hooks which allow to work with
> file systems. In other words, gjournal is not file system-depended,
> it can work probably with any file system with minimum knowledge
> about it. I implemented only UFS support.
> 
> The patches are here:
> 
> 	http://people.freebsd.org/~pjd/patches/gjournal.patch (for HEAD)
> 	http://people.freebsd.org/~pjd/patches/gjournal6.patch (for RELENG_6)
> 
> To patch your sources you need to:
> 
> 	# cd /usr/src
> 	# mkdir sbin/geom/class/journal sys/geom/journal sys/modules/geom/geom_journal
> 	# patch < /path/to/gjournal.patch
> 
> Add 'options UFS_GJOURNAL' to your kernel configuration file and
> recompile kernel and world.
> 
> How it works (in short). You may define one or two providers which
> gjournal will use. If one provider is given, it will be used for both -
> data and journal. If two providers are given, one will be used for data
> and one for journal.
> Every few seconds (you may define how many) journal is terminated and
> marked as consistent and gjournal starts to copy data from it to the
> data provider. In the same time new data are stored in new journal.
> Let's call the moment in which journal is terminated as "journal switch".
> Journal switch looks as follows:
> 1. Start journal switch if we have timeout or if we run out of cache.
>    Don't perform journal switch if there were no write requests.
> 2. If we have file system, synchronize it.
> 3. Mark file system as clean.
> 4. Block all write requests to the file system.
> 5. Terminate the journal.
> 6. Eventually wait if copying of the previous journal is not yet
>    finished.
> 7. Send BIO_FLUSH request (if the given provider supports it).
> 8. Mark new journal position on the journal provider.
> 9. Unblock write requests.
> 10. Start copying data from the terminated journal to the data provider.
> 
> There were few things I needed to implement outside gjournal to make it
> work reliable:
> 
> - The BIO_FLUSH request. Currently we have three I/O requests: BIO_READ,
> BIO_WRITE and BIO_DELETE. I added BIO_FLUSH, which means "flush your
> write cache". The request is send always with the biggest bio_offset set
> (mediasize of the destination provider), so it will work properly with
> bioq_disksort(). The caller need to stop further I/O requests before
> BIO_FLUSH return, so we don't have starvation effect.
> The hard part is that is has to be implemented in every disk driver,
> because flushing the cache is driver-depended operation. I implemented
> it for ata(4) disks and amr(4). The good news is that it's easy.
> GJournal can also work with providers that don't support BIO_FLUSH and
> in my power-failure tests it worked well (no problems), but it depend
> on fact, that gjournal cache is bigger than the controller cache, so it
> is hard to call it reliable.
> You can read in documentation to many journaled file systems, that you
> should turn off write cache if you want to use it. This is not the case
> for gjournal (especially when your disk driver does support BIO_FLUSH).
> 
> The 'gjournal' mount option. To implement gjournal support in UFS I
> needed to change the way of how deleted, but still open objects are
> handled. Currently when file or directory is open and we deleted last
> name which reference it, it will still be usable by those who keep it
> open. When the last consumer closes it, the inode and blocks are freed.
> On journal switch I cannot leave such objects, because after a crash
> fsck(8) is not used to check the file system, so inode and blocks will
> never be freed. When file system is mounted with 'gjournal' mount
> option, such objects are not removed when they are open. When last
> name is deleted, the file/directory is moved to the .deleted/
> directory and removed from there on last close.
> This way, I can just clean the .deleted/ directory after a crash at
> mount time.
> 
> Quick start:
> 
> 	# gjournal label /dev/ad0
> 	# gjournal load
> 	# newfs /dev/ad0.journal
> 	# mount -o async,gjournal /dev/ad0.journal /mnt
> 	(yes, with gjournal 'async' is safe)
> 
> Now, after a power failure or system crash no fsck is needed (yay!).
> 
> There are two hacks in the current implementation, which I'd like to
> reimplement. First is how 'gjournal' mount option is implemented.
> There is a garbage collector thread which is responsible for deleting
> objects from .deleted/ directory and it is using full paths. Because
> of this when your mount point is /foo/bar/baz and you rename 'bar' to
> something else, it will not work. This is not what is often done, but
> definitely should be fixed and I'm working on it. The second hack is
> related to communication between gjournal and file system. GJournal
> decides when to make the switch and has to find file system which is
> mounted on it. Looking for this file system is not nice and should be
> reimplemented.
> 
> There are some additional goods which came with gjournal. For example
> if gjournal is configured over gmirror or graid3, even on power failure
> or system crash, there is no need to synchronize mirror/raid3 device,
> because data will be consistent.
> 
> I spend a lot of time working on gjournal optimization. Because I've
> few seconds before the data hit the data provider I can perform things
> like combining smaller write requests into larger once, ignoring data
> written twice to the same place, etc.
> Because of this, operations on small files are quite fast. On the other
> hand, operations on large files are slower, because I need to write the
> data twice and there is no place for optimization. Here are some numbers.
> gjournal(1) - the data provider and the journal provider on the same disk
> gjournal(2) - the data provider and the journal provider on separate
> 	disks
> 
> Copying one large file:
> UFS:		8s
> UFS+SU:		8s
> gjournal(1):	16s
> gjournal(2):	14s
> 
> Copying eight large files in parallel:
> UFS:		120s
> UFS+SU:		120s
> gjournal(1):	184s
> gjournal(2):	165s
> 
> Untaring eight src.tgz in parallel:
> UFS:		791s
> UFS+SU:		650s
> gjournal(1):	333s
> gjournal(2):	309s
> 
> Reading. grep -r on two src/ directories in parallel:
> UFS:		84s
> UFS+SU:		138s
> gjournal(1):	102s
> gjournal(2):	89s
> 
> As you can see, even on one disk, untaring eight src.tgz is two times
> faster than UFS+SU. I've no idea why gjournal is faster in reading.
> 
> There are a bunch of sysctls to tune gjournal (kern.geom.journal tree).
> 
> When only one provider is given for both data and journal, the journal
> part is placed at the end of the provider, so one can use file system
> without journaling. If you use such configuration (one disk), it is
> better for performance to place journal before data, so you may want to
> create two partitions (eg. 2GB for ad0a and the rest for ad0d) and
> create gjournal this way:
> 
> 	# gjournal label ad0d ad0a
> 
> Enjoy!
> 
> The work was sponsored by home.pl (http://home.pl).
> 
> The work was made by Wheel LTD (http://www.wheel.pl).
> The work was tested in the netperf cluster.
> 
> I want to thank Alexander Kabaev (kan@) for the help with VFS and
> Mike Tancsa for test hardware.
> 

Wow, this looks pretty cool!

I wonder if it's possible to use gjournal on
existing file system with the journal on a vnode/(swap?) backed md(4) device?
(i want to test on a existing installation without free unpartitioned space)

And if it is possible, how can i do this for the root filesystem? i'll need the md(4)
device before mounting of the root fs which seems hard/impossible?
What's going to happen if my root mount is gjournal labeled and has gjournal option in
fstab but at boot time the journal GEOM provider does not exist?

Thanks for the great work!
When finished, this will certainly make FreeBSD much more competitive :)

- --niki
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFElu2yHNAJ/fLbfrkRAsVBAKChRFMVLuivXYR1NM3b0u9iVe72uwCfdzH0
DvdjEZwOKjuZu4UV+toVpwo=
=+qj/
-----END PGP SIGNATURE-----