JUFS update, and questions.

Wed Mar 10 15:21:58 PST 2004

Journaled UFS Technology Description

As many are aware we have been keenly interested in Journaling for the
UFS filesystem.  This is intended to bring people up to date on design
decisions that we have made, progress, and to solicit help for problems
that we are facing.

In the design of this system we consulted many different implementations
of journaled filesystems, including ext3fs, reiser, XFS, and JFS.  We
also received an implementation of an incomplete but highly functional
journaled UFS implementation.  From these we have attempted to construct
a "best-of-breed" solution.

>From our review we selected methods based on those used by JFS and XFS
due to their relative simplicity and performance and similarity to the
journaled UFS implementation that we have.  A brief description of this
is as follows:

There exists on disk, in the root of each filesystem a file called
.journal
Upon r/w mount this file is verified to have the following
characteristics:
1) mode: -r--------
2) user: 0, group: 0
3) flags: noschg
4) all blocks allocated (no sparse blocks, no frags)  (1)
5) That the journal is empty, and that the first entry is a checkpoint.
   (empty meaning NOT null, but there are no operations that need to 
    be committed)

The system then saves the vnode/inode reference to this file and has a 
hook in chflags that prevents modification to that vnode/inode during
operation.  The code prevents r/w mounting of the filesystem unless the
above conditions are met.

The format of the journal is roughly as follows:

Each block (FS blocksize) has this format.

Block {
	Header {
		Magic Number
		Version
		Transaction ID of this block
		Last transaction ID committed
		Length of Header
		# of transactions in block
		Options Field
		Checksum
	}
	Transaction {
		Opcode
		operand
	} (repeat for number in header)
}

In addition to this on-disk representation the system maintains an
in-core journal.  The in-core provides a buffer mechanism such that each
operation does not force a sync write.  The format of the in-core is
roughly as follows:

journal {
	current transaction ID
	Last Transaction committed ID
	first in-core entry
	last in-core entry
	first on-disk entry
	last on-disk entry
	mutex-pointer
	buffer
}

Every operation then has its information placed in the buffer, when the
buffer becomes full it is flushed to disk, when disk is full it is read
back and committed.  Periodically during periods of light disk IO there
will be a heartbeat kernel process that will force commits of all
buffered data, on disk and in core.

One of the opcodes defined is the NOP.  Its format is:

Opcode  Operand  
0x0000  length(16bit), data(arbitrary)

Aside from debugging, this is used as a checkpoint function, after a
commit the journal will write a blank journal entry out stating that
this is transaction "N", and transaction "N-1" was the last committed. 
This is also done on umount.

Journaling will be a mount option, and has so far been defined as
MNT_JOURNAL 0x00800000 (2), this flag will trigger the checks mentioned
at the beginning.  The kernel will _not_ replay the journal in the event
of an unclean mount, this will be handled by fsck for at least the
following situation:

   Handle moving between the journaled and non-journaled options, due to
   either (lack of) specifying mount flags, or different compiled 
   options.
   For example: 
       Admin mounts /usr/home with "-o journal", system crashes, system 
       comes back up and /etc/fstab has not been updated to include the 
       "journal" flag, admin later realizes this and remounts /usr/home 
       with the appropriate flag.  If fsck did not handle the journal
       syncing then the FS would be "repaired" by fsck on the reboot
       after the crash,  and the kernel would then attempt to re-repair
       the data from the journal log and be referencing a potentially
       MUCH older version of the filesystem database.

(3)

fsck will also ensure that the journal file meets the requirements
listed, specifically it will update the journal file itself to include
the checkpoint if needed. fsck's operation in brief will be as follows:
1) scan the journal file for the highest numbered transaction ID
2) Read in number of the last completed transaction from that block
3) Rescan the journal for the lowest transaction ID after that one.
4) begin replaying in order until highest transaction ID is reached.
5) write the checkpoint transaction and mark the filesystem clean.

Unmounting of the filesystem will include a full commit of the journal 
(in-core and on-disk), and a write of the checkpoint opcode to the first
journal block.

Given the nature of what we are doing (and how), its incompatible to 
mount a filesystem both journaled and softdept-ed, our code will prevent
an admin/user from trying to do both at once with a deny message, it
will not just silently fail.

Issues that we are having now include how and when to increment the
transaction ID.  The transaction IDs are used to group operations
together such that related operations are completed together, and to
guarantee replay-safeness.  For example a rename(2) is a combination of
a link and an unlink.  So it works something like this:

TID=5
rename(2) call made
TID++
 link   (opcode tagged with TID 6)
 unlink (opcode tagged with TID 6)
TID=6

Later, when this is flushed to disk the system will make sure that all
opcodes with the same TID are written, and not split across blocks.  The
TID in the header of the block will be the TID of the last opcode in
that block.  So that it then becomes a super-transaction of all of them
(potentially thousands of smaller transactions).

An unlink would be similar to this (assuming no processes holding the
file open, and a link count of 1)

TID=6
unlink(2) call made
TID++
 unlink
 inode update (link_cnt--)
 inode update (free)
 truncate
TID=7

Assuming a flush to disk now would have the following:

Header { TID = 7 , count=6, lastTID=5 }
opcodes {
 link
 unlink   <--- these were the rename(2)
 unlink
 inode update
 inode update
 truncate
}

This block could then be safely replayed multiple times (Think situation
of a crash where this had been committed but the checkpoint not written,
fsck would then replay this since it could not know that it was already
done)

These examples are relatively easy, what we are running into problems
with is things that bypass the vfs layer.  An example is mmaping of a
sparse file,  a write access to the middle of the file could trigger a
large number of updates.  Inode changes, direct block allocations,
indirect block  allocations, and fragment promotions.  In this
situation, and in our model, how and where would we increment the
transaction ID?

Notes:
(1) I do not know how to actually do this within the kernel, pointers
    here would be appreciated.

(2) This currently conflicts with MNT_IGNORE.  Is this a problem?
    What should we use?

(3) There is another problem here, files that were held open when the
    system crashed.  They could have a reference count of zero, but
    still have allocated data.  It seems that an fsck would still be
    required to walk the inode tables and put these files "somewhere",
    or just free the blocks they were using.  Can anyone think of a
    better way to do this?

-- 
David E. Cross