Constant rebooting after power loss

Sat Apr 2 03:35:52 UTC 2011

    The core of the issue here comes down to two things:

    First, a power loss to the drive will cause the drive's dirty write cache
    to be lost, that data will not make it to disk.  Nor do you really want
    to turn of write caching on the physical drive.  Well, you CAN turn it
    off, but if you do performance will become so bad that there's no point.
    So turning off the write caching is really a non-starter.

    The solution to this first item is for the OS/filesystem to issue a
    disk flush command to the drive at appropriate times.  If I recall the
    ZFS implementation in FreeBSD *DOES* do this for transaction groups,
    which guarantees that a prior transaction group is fully synced before
    a new ones starts running (HAMMER in DragonFly also does this).
    (Just getting an 'ack' from the write transaction over the SATA bus only
    means the data made it to the drive's cache, not that it made it to
    the platter).

    I'm not sure about UFS vis-a-vie the recent UFS logging features...
    it might be an option but I don't know if it is a default.  Perhaps
    someone can comment on that.

    One last note here.  Many modern drives have very large ram caches.
    OCZ's SSDs have something like 256MB write caches and many modern HDs
    now come with 32MB and 64MB caches.  Aged drives with lots of relocated
    sectors and bit errors can also take a very long time to perform writes
    on certain sectors.  So these large caches take time to drain and one
    can't really assume that an acknowledged write to disk will actually
    make it to the disk under adverse circumstances any more.  All sorts
    of bad things can happen.

    Finally, the drives don't order their writes to the platter (you can
    set a bit to tell them to, but like many similar bits in the past there
    is no real guarantee that the drives will honor it).  So if two
    transactions do not have a disk flush command inbetween them it is
    possible for data from the second transaction to commit to the platter
    before all the data from the first transaction commits to the platter.
    Or worse, for the non-transactional data to update out of order relative
    to the transactional data which was supposed to commit first.

    Hence IMHO the OS/filesystem must use the disk flush command in such
    situations for good reliability.

    --

    The second problem is that a physical loss of power to the drive can
    cause the drive to physically lose one or more sectors, and can even
    effectively destroy the drive (even with the fancy auto-park)... if the
    drive happens to be in the middle of a track write-back when power is
    lost it is possible to lose far more than a single sector, including
    sectors unrelated to recent filesystem operations.

    The only solution to #2 is to make sure your machines (or at least the
    drives if they happen to be in external enclosures) are connected to
    a UPS and that the machines are communicating with the UPS via
    something like the "apcupsd" port.  AND also that you test to make
    sure the machines properly shut themselves down when AC is lost before
    the UPS itself runs out of battery time.  After all, a UPS won't help
    if the machines don't at least idle their drives before power is lost!!!

    I learned this lesson the hard way about 3 years ago.  I had something
    like a dozen drives in two raid arrays doing heavy write activity and
    lost physical power and several of the drives were totally destroyed,
    with thousands of sector errors.  Not just one or two... thousands.

    (It is unclear how SSDs react to physical loss of power during heavy
    writing activity.  Theoretically while they will certainly lose their
    write cache they shouldn't wind up with any read errors).

						-Matt