Is ZFS production ready?

Sun Jun 24 23:41:21 UTC 2012

On Thu, Jun 21, 2012 at 4:44 PM, Wojciech Puchar <
wojtek at wojtek.tensor.gdynia.pl> wrote:

>  One interesting feature of ZFS if it's block checksum: all reads and
>> writes include block checksum, so it can easily detect situations where,
>> for example, data is quietly corrupted by RAM.
>>
>
> you may be shocked but you are sometimes wrong. i already demostrated it
> and checksumming doesn't get any errors, and do write wrong data with right
> checksums :)
>
> it's quite easy to explain if one understand hardware details.
>
> Checksumming will protect you from
>
> - failed SATA/SAS port, on-disk controller that returns bad data as good.
> This is actually really rare case. i never seen that, but maybe it happens.
>
> - some types of DRAM failure - but not all. Actually just a small fraction
> because DRAM failure like that would bring your system to crash so quickly
> that you are unlikely to get big data corruption.
>
> Common case with DRAM memory is that after you write to it, keeps right
> data some time and RARELY flips some bit later in spite of refresh.
>
> With this type you may run your machine for hours, even days or longer.
> And ZFS would calculate proper checksum of wrong data and will write it to
> disk.
>
>
> This is the reason i keep few failed DIMMs - for testing how different
> software behaves on broken machine.
>
> UFS resulted in few corrupted files after half a day of heavy work and 4
> crashes. fsck always recovered things well (of course "unexpected
> softupdate inconsistency....")
>
> ZFS survived 2 crashes. After third it panicked on startup.
>
> Of course - no zfs_fsck.
> And no possibility of making really good zfs_fsck because of data layout,
> at least not easy.
>
>
>
>  This feature is very important for databases.
>>
> is data integrity not important for the rest? :)
>
> Still - disks itself perform quite heavy ECC and both SATA and SAS ports.
>

While I don't dispute you're test's findings I would like to point out that
you are SPECIFICALLY testing for something that the original designers of
ZFS (SUN now Oracle) point out VERY clearly as being an issue that you
should avoid in you're deployed environments.

The filesystem is designed to protect the ON DISK data and being a highly
memory intensive filesystem should ALWAYS be deployed on hardware with
memory error correction build in (aka ECC RAM deployed across multiple
banks).

The filesystem comes from an hardware/OS environment that is HEAVILY BIASED
towards "self healing" as they put it and as a result things like memory
module issues would:
    1) Either be corrected by the ECC modules
    2) Be reported to the administrator of said system as soon as they
occur (well on a system where you have such reporting setup correctly)

As a result you're argument is moot....whilst you're findings are indeed
still valid.

UFS2 being MUCH lighter on RAM requirements is, well frankly, quite
possibly not even interacting with the damaged sections of the memory
modules in you're test and I am almost certain that if we were to ask
around on this mailing list enough examples of UFS/UFS2 corruption due to
faulty RAM are VERY VERY likely to come up.
    No filesystem (or other code for that matter) would be able to detect
RAM content corruption (as this is NOT a filesystem's job) and correct it
for you as frankly the kernel wouldn't know if the data in the buffers is
correct or not without the application storing said data being coded to
check for these conditions (I know of a patch to the Linux kernel that does
indeed look for faulty RAM segments and works around them but I am
*mostly*positive that no general purpose OS in current deployment does
so as I have
noticed that this behavior was VERY CPU intensive).

Also (debate encouraged here) due to the COW nature of ZFS a zfs_fsck
command is basically entirely unnecessary as 1) The last successfully
completed write to the file will be intact and
2) Scrubbing the on disk content performs a much better filesystem
maintenance than an fsck does and this can also be done online without
impacting uptimes of you're systems/data availability.
    On my systems I specifically trigger a scrub (via the ZFS init script)
whenever my systems are uncleanly shut down as I am willing to tolerate a
slightly slower but available system in such conditions.

While UFS2 is indeed an wonderfully reliable filesystem it (as with all
things) is not suited to all tasks, there are many instances where I can
see the features of ZFS far outweighing the detractions (as do I see the
same for the converse state of affairs).

While all the above is purely based on my understanding of ZFS (and I am
one of the people working on a port to GNU/Linux - admittedly not directly
but I spend a LOT of my time reading/cleaning up the code fork that I do
use) and SUN's (now Oracle's) design/deployment documents,...it is still my
opinion and I would encourage a debate on these opinions.