UNEXPECTED SOFT UPDATE INCONSISTENCY; RUN fsck MANUALLY

Tue Sep 30 19:00:31 UTC 2008

:The topic of BIO_FLUSH is something I got to thinking about last night
:at work; the only condition where a disk with write caching enabled
:*would not* fully write the data to the platter would in fact be power
:loss.  All other conditions (specifically soft reset and panic) should
:not require explicit flushing.
:
:I wonder why this is being done, especially on shutdown of FreeBSD.
:Assuming I understand it correctly, I'm talking about this:
:
:Waiting (max 60 seconds) for system process `bufdaemon' to stop...done
:Waiting (max 60 seconds) for system process `syncer' to stop...
:Syncing disks, vnodes remaining...3 3 3 2 2 0 0 done
:All buffers synced.
:
:-- 
:| Jeremy Chadwick                                jdc at parodius.com |

    BIO_FLUSH and "Syncing disks, vnodes ..." are two different things,
    so I'm not sure of the context but I will describe issues with both.

    --

    BIO_FLUSH commands the disk firmware to flush out any dirty buffers in
    its drive cache.   That is, writes that you have *already* issued to
    the drive and which returned completion, but which have not actually
    made it to the physical media yet.  This is different from dirty buffers
    still being maintained by the kernel which have not yet been sent to
    the drive.  (Just repeating this so the definition is clear to all
    the readers).

    So, yes, you would want to do a BIO_FLUSH before powering down a 
    machine (halt -p) to ensure that all the dirty data you sent to the 
    disk actually gets to the platter.

    I think you also want to issue it for a soft reset.  It would not
    effect a SATA drive but it certainly would effect a USB drive powered
    from the computer.  USB ports will be powered down during a soft
    reset.  BIO_FLUSH isn't likely to cause problems during a crash, unlike
    flushing the buffer cache.

    Some people may remember earlier versions of Windows XP often powered
    the machine down before the hard drive managed to write all of its data
    to the platter.  Sometime that would even destroy sectors on the drive.

    We know bad things happen if we don't issue the command, so best not to
    take chances by making assumptions.

    --

    The "Syncing disks, vnodes ..." is the kernel flushing out any dirty
    data in the buffer cache which has not yet been sent to the disk
    driver.

    This is more problematic.  Filesystems such as HAMMER (and presumably
    ZFS) absolutely do NOT want the system to flush dirty buffers unless
    they explicitly give permission to do so, because the dirty buffers
    might represent data for which the recovery information has not yet
    been written out, and thus can corrupt the filesystem on-media if a
    crash were to occur right then.

    In HAMMER's case I enchanced the bioops a bit to allow HAMMER to veto
    write-outs initiated by the system.  sync_on_panic is irrelevant,
    the buffers will not be synced without HAMMER's permission and it 
    won't give it.

    There is also the very real general case where a traditional filesystem
    such as UFS must peform multiple buffer cache ops, dirtying multiple
    buffer cache buffers, in order to complete an operation.  If a crash
    were to occur right in the middle of such a sequence the kernel would
    wind up writing dirty buffers related to incomplete operations to the
    media, resulting in corruption.

    In the case of softupdates one is presented with a conundrum.  If you
    don't write out the buffer cache during a crash you stand to lose a lot
    more then 60 seconds worth of changes due to deep dependancy chains.
    One 'sync' doesn't do the job and even though it is supposed to get all
    the primary data and meta-data onto the disk and just leave the bitmap
    updates for background operations it doesn't always seem to do that.
    The softupdates code is very fragile.

    On the other hand, if you *DO* try to write out the buffer cache during
    a crash you have a good chance of deadlocking the system or
    double-panicing, resulting in inconsistencies on the media, and you
    risk doing a partial write out also resulting in inconsistencies on the
    media.

    Here is example:  How does the crash code deal with dirty but locked
    buffer cache buffers?  Say you have a softupdates filesystem and through
    the course of operations you dirty a dozen buffers, then a crash occurs
    while you are in the middle of ANOTHER softupdates operation which is
    holding several buffers already dirtied by previous operations locked.

    What happens now if the crash code tries to sync the buffer cache?  Will
    it sync the previously dirtied buffers that are currently locked?   Will
    it sync the ones that haven't been locked but skip the ones that are
    locked?  You lose both ways.  There is no way to safely sync ANYTHING,
    whether locked or not, without risking unexpected softupdates
    inconsistencies on-media.  This alone makes background fsck problematic
    and risky.

					-Matt
					Matthew Dillon 
					<dillon at backplane.com>