JUFS update, and questions.
David E. Cross
crossd at cs.rpi.edu
Wed Mar 10 15:21:58 PST 2004
Journaled UFS Technology Description
As many are aware we have been keenly interested in Journaling for the
UFS filesystem. This is intended to bring people up to date on design
decisions that we have made, progress, and to solicit help for problems
that we are facing.
In the design of this system we consulted many different implementations
of journaled filesystems, including ext3fs, reiser, XFS, and JFS. We
also received an implementation of an incomplete but highly functional
journaled UFS implementation. From these we have attempted to construct
a "best-of-breed" solution.
>From our review we selected methods based on those used by JFS and XFS
due to their relative simplicity and performance and similarity to the
journaled UFS implementation that we have. A brief description of this
is as follows:
There exists on disk, in the root of each filesystem a file called
.journal
Upon r/w mount this file is verified to have the following
characteristics:
1) mode: -r--------
2) user: 0, group: 0
3) flags: noschg
4) all blocks allocated (no sparse blocks, no frags) (1)
5) That the journal is empty, and that the first entry is a checkpoint.
(empty meaning NOT null, but there are no operations that need to
be committed)
The system then saves the vnode/inode reference to this file and has a
hook in chflags that prevents modification to that vnode/inode during
operation. The code prevents r/w mounting of the filesystem unless the
above conditions are met.
The format of the journal is roughly as follows:
Each block (FS blocksize) has this format.
Block {
Header {
Magic Number
Version
Transaction ID of this block
Last transaction ID committed
Length of Header
# of transactions in block
Options Field
Checksum
}
Transaction {
Opcode
operand
} (repeat for number in header)
}
In addition to this on-disk representation the system maintains an
in-core journal. The in-core provides a buffer mechanism such that each
operation does not force a sync write. The format of the in-core is
roughly as follows:
journal {
current transaction ID
Last Transaction committed ID
first in-core entry
last in-core entry
first on-disk entry
last on-disk entry
mutex-pointer
buffer
}
Every operation then has its information placed in the buffer, when the
buffer becomes full it is flushed to disk, when disk is full it is read
back and committed. Periodically during periods of light disk IO there
will be a heartbeat kernel process that will force commits of all
buffered data, on disk and in core.
One of the opcodes defined is the NOP. Its format is:
Opcode Operand
0x0000 length(16bit), data(arbitrary)
Aside from debugging, this is used as a checkpoint function, after a
commit the journal will write a blank journal entry out stating that
this is transaction "N", and transaction "N-1" was the last committed.
This is also done on umount.
Journaling will be a mount option, and has so far been defined as
MNT_JOURNAL 0x00800000 (2), this flag will trigger the checks mentioned
at the beginning. The kernel will _not_ replay the journal in the event
of an unclean mount, this will be handled by fsck for at least the
following situation:
Handle moving between the journaled and non-journaled options, due to
either (lack of) specifying mount flags, or different compiled
options.
For example:
Admin mounts /usr/home with "-o journal", system crashes, system
comes back up and /etc/fstab has not been updated to include the
"journal" flag, admin later realizes this and remounts /usr/home
with the appropriate flag. If fsck did not handle the journal
syncing then the FS would be "repaired" by fsck on the reboot
after the crash, and the kernel would then attempt to re-repair
the data from the journal log and be referencing a potentially
MUCH older version of the filesystem database.
(3)
fsck will also ensure that the journal file meets the requirements
listed, specifically it will update the journal file itself to include
the checkpoint if needed. fsck's operation in brief will be as follows:
1) scan the journal file for the highest numbered transaction ID
2) Read in number of the last completed transaction from that block
3) Rescan the journal for the lowest transaction ID after that one.
4) begin replaying in order until highest transaction ID is reached.
5) write the checkpoint transaction and mark the filesystem clean.
Unmounting of the filesystem will include a full commit of the journal
(in-core and on-disk), and a write of the checkpoint opcode to the first
journal block.
Given the nature of what we are doing (and how), its incompatible to
mount a filesystem both journaled and softdept-ed, our code will prevent
an admin/user from trying to do both at once with a deny message, it
will not just silently fail.
Issues that we are having now include how and when to increment the
transaction ID. The transaction IDs are used to group operations
together such that related operations are completed together, and to
guarantee replay-safeness. For example a rename(2) is a combination of
a link and an unlink. So it works something like this:
TID=5
rename(2) call made
TID++
link (opcode tagged with TID 6)
unlink (opcode tagged with TID 6)
TID=6
Later, when this is flushed to disk the system will make sure that all
opcodes with the same TID are written, and not split across blocks. The
TID in the header of the block will be the TID of the last opcode in
that block. So that it then becomes a super-transaction of all of them
(potentially thousands of smaller transactions).
An unlink would be similar to this (assuming no processes holding the
file open, and a link count of 1)
TID=6
unlink(2) call made
TID++
unlink
inode update (link_cnt--)
inode update (free)
truncate
TID=7
Assuming a flush to disk now would have the following:
Header { TID = 7 , count=6, lastTID=5 }
opcodes {
link
unlink <--- these were the rename(2)
unlink
inode update
inode update
truncate
}
This block could then be safely replayed multiple times (Think situation
of a crash where this had been committed but the checkpoint not written,
fsck would then replay this since it could not know that it was already
done)
These examples are relatively easy, what we are running into problems
with is things that bypass the vfs layer. An example is mmaping of a
sparse file, a write access to the middle of the file could trigger a
large number of updates. Inode changes, direct block allocations,
indirect block allocations, and fragment promotions. In this
situation, and in our model, how and where would we increment the
transaction ID?
Notes:
(1) I do not know how to actually do this within the kernel, pointers
here would be appreciated.
(2) This currently conflicts with MNT_IGNORE. Is this a problem?
What should we use?
(3) There is another problem here, files that were held open when the
system crashed. They could have a reference count of zero, but
still have allocated data. It seems that an fsck would still be
required to walk the inode tables and put these files "somewhere",
or just free the blocks they were using. Can anyone think of a
better way to do this?
--
David E. Cross
More information about the freebsd-fs
mailing list