UNEXPECTED SOFT UPDATE INCONSISTENCY; RUN fsck MANUALLY
dillon at apollo.backplane.com
Mon Sep 29 17:44:12 UTC 2008
A couple of things to note here. Well, many things actually.
* Turning off write caching, assuming the drive even looks at the bit,
will destroy write performance for any driver which does not support
command queueing. So, for example, scsi typically has command
queueing (as long as the underlying drive firmware actually implements
it properly), 3Ware cards have it (underlying drives, if SATA, may not,
but 3Ware's firmware itself might do the right thing).
The FreeBSD ATA driver does not, not even in AHCI mode. The RAID
code does not as far as I can tell. You don't want to turn this off.
* Filesystems like ZFS and HAMMER make no assumptions on write
ordering to disk for completed write I/O vs future write I/O
and use BIO_FLUSH to enforce ordering on-disk. These filesystems
are able to queue up large numbers of parallel writes inbetween
each BIO_FLUSH, so the flush operation has only a very small
effect on actual performance.
Numerous Linux filesystems also use the flush command and do
not make assumptions on BIO-completion/future-BIO ordering.
* UFS + softupdates assumes write ordering between completed BIO's
and future BIOs. This doesn't hold true on a modern drive (with
write caching turned on). Unfortunately it is ALSO not really
the cause behind most of the inconsistency reports.
UFS was *never* designed to deal with disk flushing. Softupdates
was never designed with a BIO_FLUSH command in mind. They were
designed for formally ordered I/O (bowrite) which fell out of
favor about a decade ago and has since been removed from most
* Don't get stuck in a rut and blame DMA/Drive/firmware for all the
troubles. It just doesn't happen often enough to even come close
to being responsible for the number of bug reports.
With some work UFS can be modified to do it, but performance will
probably degrade considerably because the only way to do it is to
hold the completed write BIOs (not biodone() them) until something
gets stuck, or enough build up, then issue a BIO_FLUSH and, after
it returns, finish completing the BIOs (call the biodone()) for the
prior write I/Os. This will cause softupdates to work properly.
Softupdates orders I/O's based on BIO completions.
Another option would be to complete the BIOs but do major surgery on
softupdates itself to mark the dependancies as waiting for a flush,
then flush proactively and re-sync.
Unfortunately, this will not solve the whole problem. IF THE DRIVE
DOESN'T LOOSE POWER IT WILL FLUSH THE BIOs IT SAID WERE COMPLETED.
In otherwords, unless you have an actual power failure the assumptions
softupdates will hold. A kernel crash does NOT prevent the actual
drive from flushing the IOs in its cache. The disk can wind up with
unexpected softupdates inconsistencies on reboot anyway. Thus the
source of most of the inconsistency reports will not be fixed by adding
this feature. So more work is needed on top of that.
Nearly ALL of the unexpected softupdates inconsistencies you see *ARE*
for the case where the drive DOES in fact get all the BIO data it
returned as completed onto the disk media. This has happened to me
many, many times with UFS. I'm repeating this: Short of an actual
power failure, any I/O's sent to and acknowledged by the drive are
flushed to the media before the drive resets. A FreeBSD crash does
not magically prevent the drive from flushing out its internal queues.
This means that there are bugs in softupdates & the kernel which can
result in unexpected inconsistencies on reboot. Nobody has ever
life-tested softupdates to try to locate and fix the issues. Though I
do occassionally see commits that try to fix various issues, they tend
to be more for live-side non-crash cases then for crash cases.
Some easy areas which can be worked on:
* Don't flush the buffer cache on a crash. Some of you already do this
for other reasons (it makes it more likely that you can get a crash
The kernel's flushing of the buffer cache is likely a cause of a
good chunk of the inconsitency reports by fsck, because unless
someone worked on the buffer flushing code it likely bypasses
softupdates. I know when working on HAMMER I had to add a bioop
explicitly to allow the kernel flush-buffers-on-crash code to query
whether it was actually ok to flush a dirty buffer or not. Until I
did that DragonFly was flushing HAMMER buffers which on crash which
it had absolutely no business flushing.
* Implement active dependancy flushing in softupdates. Instead of
having it just adjust the dependancies for later flushes softupdates
needs to actively initiate I/O for the dependancies as they are
resolved. To do this will require implementing a flush queue,
you can't just recurse (you will blow out the kernel stack).
If you dont do this then you have to sync about a dozen times,
with short delays between each sync, to ensure that all the
dependancies are flushed. The only time this is done automatically
is during a nominal umount during shutdown.
* Once the above two are fixed start testing within virtual environments
by virtually pulling the plug, and virtually crashing the kernel.
Then fscking to determine if an unexpected softupdates inconsistency
occured. There are probably numerous cases that remain.
Of course, what you guys decide to do with your background fsck is up
to you, but it seems to be a thorn in the side of FreeBSD from the day
it was introduced, along with snapshots. I very explicitly avoided
porting both the background fsck and softupdates snapshot code to DFly
due to their lack of stability.
The simple fact of the matter is that UFS just does not recover well
on a large disk. Anything over 30-40 million inodes and you risk
not being able to fsck the drive at all, not even in 64-bit mode (you
will run out of swap). You get one inconsistency and the filesystem
is broken forever. Anything over 200GB and your background fsck can
wind up taking hours, seriously degrading the performance of the system
in the process. It can take 6 hours to fsck a full 1TB HD. It can
take over a day to fsck larger setups. Putting in a few sleeps here
and there just makes the run time even longer and perpetuates the pain.
My recommendation? Default UFS back to a synchronous fsck and stop
treating ZFS (your only real alternative) as being so ultra-alpha that
it shouldn't be used. Start recommending it for any filesystem larger
then 200GB. Clean up the various UI issues that can lead to self
immolation and foot stomping. Fix the defaults so they don't blow out
kernel malloc areas, etc etc. Fix whatever bugs pop up. UFS is
already unsuitable for 'common' 1TB consumer drives even WITH the
background fsck. ZFS is ALREADY far safer to use then UFS for
large disks, given reasonable constraints on feature selection.
More information about the freebsd-stable