Constant rebooting after power loss
Matthew Dillon
dillon at apollo.backplane.com
Sat Apr 2 03:35:52 UTC 2011
The core of the issue here comes down to two things:
First, a power loss to the drive will cause the drive's dirty write cache
to be lost, that data will not make it to disk. Nor do you really want
to turn of write caching on the physical drive. Well, you CAN turn it
off, but if you do performance will become so bad that there's no point.
So turning off the write caching is really a non-starter.
The solution to this first item is for the OS/filesystem to issue a
disk flush command to the drive at appropriate times. If I recall the
ZFS implementation in FreeBSD *DOES* do this for transaction groups,
which guarantees that a prior transaction group is fully synced before
a new ones starts running (HAMMER in DragonFly also does this).
(Just getting an 'ack' from the write transaction over the SATA bus only
means the data made it to the drive's cache, not that it made it to
the platter).
I'm not sure about UFS vis-a-vie the recent UFS logging features...
it might be an option but I don't know if it is a default. Perhaps
someone can comment on that.
One last note here. Many modern drives have very large ram caches.
OCZ's SSDs have something like 256MB write caches and many modern HDs
now come with 32MB and 64MB caches. Aged drives with lots of relocated
sectors and bit errors can also take a very long time to perform writes
on certain sectors. So these large caches take time to drain and one
can't really assume that an acknowledged write to disk will actually
make it to the disk under adverse circumstances any more. All sorts
of bad things can happen.
Finally, the drives don't order their writes to the platter (you can
set a bit to tell them to, but like many similar bits in the past there
is no real guarantee that the drives will honor it). So if two
transactions do not have a disk flush command inbetween them it is
possible for data from the second transaction to commit to the platter
before all the data from the first transaction commits to the platter.
Or worse, for the non-transactional data to update out of order relative
to the transactional data which was supposed to commit first.
Hence IMHO the OS/filesystem must use the disk flush command in such
situations for good reliability.
--
The second problem is that a physical loss of power to the drive can
cause the drive to physically lose one or more sectors, and can even
effectively destroy the drive (even with the fancy auto-park)... if the
drive happens to be in the middle of a track write-back when power is
lost it is possible to lose far more than a single sector, including
sectors unrelated to recent filesystem operations.
The only solution to #2 is to make sure your machines (or at least the
drives if they happen to be in external enclosures) are connected to
a UPS and that the machines are communicating with the UPS via
something like the "apcupsd" port. AND also that you test to make
sure the machines properly shut themselves down when AC is lost before
the UPS itself runs out of battery time. After all, a UPS won't help
if the machines don't at least idle their drives before power is lost!!!
I learned this lesson the hard way about 3 years ago. I had something
like a dozen drives in two raid arrays doing heavy write activity and
lost physical power and several of the drives were totally destroyed,
with thousands of sector errors. Not just one or two... thousands.
(It is unclear how SSDs react to physical loss of power during heavy
writing activity. Theoretically while they will certainly lose their
write cache they shouldn't wind up with any read errors).
-Matt
More information about the freebsd-stable
mailing list