ZFS i/o errors - which disk is the problem?

Bernd Walter ticso at cicely12.cicely.de
Mon Jan 7 11:22:05 PST 2008


On Tue, Jan 08, 2008 at 01:17:03AM +0800, Tz-Huan Huang wrote:
> 2008/1/7, Bernd Walter <ticso at cicely12.cicely.de>:
> > The data is corrupted by controller and/or disk subsystem.
> > You have no other data sources for the broken data, so it is lost.
> > The only garantied way is to get it back from backup.
> > Maybe older snapshots/clones are still readable - I don't know.
> > Nevertheless data is corrupted and that's the purpose for alternative
> > data sources such as raidz/mirror and at last backup.
> > You shouldn't have ignored those errors at first, because you are
> > running with faulty hardware.
> > Without ZFS checksumming the system would just process the broken
> > data with unpredictable results.
> > If all those errors are fresh then you likely used a broken RAID
> > controller below ZFS, which silently corrupted syncronity and then
> > blow when disk state changed.
> > Unfortunately many RAID controllers are broken and therefor useless.
> 
> Hi,
> 
> Thank you very much for your answer.
> 
> We have run the self-test for all raid controllers and they all reported ok.
> Do you mean that many raid controllers are broken (buggy?) even if the
> self-test is passed? If all the disks are pass-through to the zfs, is
> it the safe
> way to use the buggy controllers?

If the controller is that buggy that even their own self test fails
it would be even worse.
But they can't test if they corrupted data - they just can test the
current state and the syncronisation.
They could do massive read/write tests, but this would mean overwriting
the current data.
If you export single disks ZFS can handle this using the redundancy,
which means if it encounters an error it can use the other disks
to recover the data.
In your case ZFS doesn't know about redundancy and your controller
returns faulty data, so there is no try to recover.
You RAID controller can't help either because it isn't aware of it's
own mess, since it is not using CRC itself and even then it could also
be a case were the data gets corrupted while transmitting into the
host or from the host, or even a driver problem.
But relying on ZFS is not a safe way either, just a bit less critical.
Safe is only to not use buggy controller at all.
The good point with ZFS CRC is that you are aware of the problem even
in case of corrupted file data.
In your case unfortunately it seem to have broken too much.

-- 
B.Walter                http://www.bwct.de      http://www.fizon.de
bernd at bwct.de           info at bwct.de            support at fizon.de


More information about the freebsd-fs mailing list