UNEXPECTED SOFT UPDATE INCONSISTENCY; RUN fsck MANUALLY

Mon Sep 29 17:44:12 UTC 2008

    A couple of things to note here.  Well, many things actually.

    * Turning off write caching, assuming the drive even looks at the bit,
      will destroy write performance for any driver which does not support
      command queueing.  So, for example, scsi typically has command
      queueing (as long as the underlying drive firmware actually implements
      it properly), 3Ware cards have it (underlying drives, if SATA, may not,
      but 3Ware's firmware itself might do the right thing). 

      The FreeBSD ATA driver does not, not even in AHCI mode.  The RAID
      code does not as far as I can tell.  You don't want to turn this off.

    * Filesystems like ZFS and HAMMER make no assumptions on write
      ordering to disk for completed write I/O vs future write I/O
      and use BIO_FLUSH to enforce ordering on-disk.  These filesystems
      are able to queue up large numbers of parallel writes inbetween
      each BIO_FLUSH, so the flush operation has only a very small 
      effect on actual performance.

      Numerous Linux filesystems also use the flush command and do
      not make assumptions on BIO-completion/future-BIO ordering.

    * UFS + softupdates assumes write ordering between completed BIO's
      and future BIOs.  This doesn't hold true on a modern drive (with
      write caching turned on).  Unfortunately it is ALSO not really
      the cause behind most of the inconsistency reports.

      UFS was *never* designed to deal with disk flushing.  Softupdates
      was never designed with a BIO_FLUSH command in mind.  They were
      designed for formally ordered I/O (bowrite) which fell out of
      favor about a decade ago and has since been removed from most 
      operating systems.

    * Don't get stuck in a rut and blame DMA/Drive/firmware for all the
      troubles.  It just doesn't happen often enough to even come close
      to being responsible for the number of bug reports.

    With some work UFS can be modified to do it, but performance will
    probably degrade considerably because the only way to do it is to
    hold the completed write BIOs (not biodone() them) until something
    gets stuck, or enough build up, then issue a BIO_FLUSH and, after
    it returns, finish completing the BIOs (call the biodone()) for the
    prior write I/Os.  This will cause softupdates to work properly.
    Softupdates orders I/O's based on BIO completions. 

    Another option would be to complete the BIOs but do major surgery on
    softupdates itself to mark the dependancies as waiting for a flush,
    then flush proactively and re-sync.

    Unfortunately, this will not solve the whole problem.  IF THE DRIVE
    DOESN'T LOOSE POWER IT WILL FLUSH THE BIOs IT SAID WERE COMPLETED.
    In otherwords, unless you have an actual power failure the assumptions
    softupdates will hold.  A kernel crash does NOT prevent the actual
    drive from flushing the IOs in its cache.  The disk can wind up with
    unexpected softupdates inconsistencies on reboot anyway.  Thus the
    source of most of the inconsistency reports will not be fixed by adding
    this feature.  So more work is needed on top of that.

    --

    Nearly ALL of the unexpected softupdates inconsistencies you see *ARE*
    for the case where the drive DOES in fact get all the BIO data it
    returned as completed onto the disk media.  This has happened to me
    many, many times with UFS.  I'm repeating this:  Short of an actual
    power failure, any I/O's sent to and acknowledged by the drive are
    flushed to the media before the drive resets.  A FreeBSD crash does
    not magically prevent the drive from flushing out its internal queues.

    This means that there are bugs in softupdates & the kernel which can
    result in unexpected inconsistencies on reboot.  Nobody has ever
    life-tested softupdates to try to locate and fix the issues.  Though I
    do occassionally see commits that try to fix various issues, they tend
    to be more for live-side non-crash cases then for crash cases.

    Some easy areas which can be worked on:

    * Don't flush the buffer cache on a crash.   Some of you already do this
      for other reasons (it makes it more likely that you can get a crash
      dump).

      The kernel's flushing of the buffer cache is likely a cause of a
      good chunk of the inconsitency reports by fsck, because unless
      someone worked on the buffer flushing code it likely bypasses
      softupdates.  I know when working on HAMMER I had to add a bioop
      explicitly to allow the kernel flush-buffers-on-crash code to query
      whether it was actually ok to flush a dirty buffer or not.  Until I
      did that DragonFly was flushing HAMMER buffers which on crash which
      it had absolutely no business flushing.

    * Implement active dependancy flushing in softupdates.  Instead of
      having it just adjust the dependancies for later flushes softupdates
      needs to actively initiate I/O for the dependancies as they are
      resolved.  To do this will require implementing a flush queue,
      you can't just recurse (you will blow out the kernel stack).

      If you dont do this then you have to sync about a dozen times,
      with short delays between each sync, to ensure that all the
      dependancies are flushed.  The only time this is done automatically
      is during a nominal umount during shutdown.

    * Once the above two are fixed start testing within virtual environments
      by virtually pulling the plug, and virtually crashing the kernel.
      Then fscking to determine if an unexpected softupdates inconsistency
      occured.  There are probably numerous cases that remain.

    Of course, what you guys decide to do with your background fsck is up
    to you, but it seems to be a thorn in the side of FreeBSD from the day
    it was introduced, along with snapshots.  I very explicitly avoided
    porting both the background fsck and softupdates snapshot code to DFly
    due to their lack of stability.

    The simple fact of the matter is that UFS just does not recover well
    on a large disk.  Anything over 30-40 million inodes and you risk
    not being able to fsck the drive at all, not even in 64-bit mode (you
    will run out of swap).  You get one inconsistency and the filesystem
    is broken forever.  Anything over 200GB and your background fsck can
    wind up taking hours, seriously degrading the performance of the system
    in the process.  It can take 6 hours to fsck a full 1TB HD.  It can
    take over a day to fsck larger setups.  Putting in a few sleeps here
    and there just makes the run time even longer and perpetuates the pain.

    My recommendation?  Default UFS back to a synchronous fsck and stop
    treating ZFS (your only real alternative) as being so ultra-alpha that
    it shouldn't be used.  Start recommending it for any filesystem larger
    then 200GB.  Clean up the various UI issues that can lead to self
    immolation and foot stomping.  Fix the defaults so they don't blow out
    kernel malloc areas, etc etc.  Fix whatever bugs pop up.  UFS is
    already unsuitable for 'common' 1TB consumer drives even WITH the
    background fsck.  ZFS is ALREADY far safer to use then UFS for
    large disks, given reasonable constraints on feature selection.

						-Matt