redundant zfs pool, system traps and tonns of corrupted files

Thu Jun 29 13:43:12 UTC 2017

On Thu, Jun 29, 2017 at 6:04 AM, Eugene M. Zheganin <emz at norma.perm.ru> wrote:
> Hi,
>
> On 29.06.2017 16:37, Eugene M. Zheganin wrote:
>>
>> Hi.
>>
>>
>> Say I'm having a server that traps more and more often (different panics:
>> zfs panics, GPFs, fatal traps while in kernel mode etc), and then I realize
>> it has tonns of permanent errors on all of it's pools that scrub is unable
>> to heal. Does this situation mean it's a bad memory case ? Unfortunately I
>> switched the hardware to an identical server prior to encountering zpools
>> have errors, so I'm not use when did they appear. Right now I'm about to run
>> a memtest on an old hardware.
>>
>>
>> So, whadda you say - does it point at the memory as the root problem ?

Certainly a good guess.

>>
>
> I'm also not quite getting the situation when I have errors on a vdev level,
> but 0 errors on a lower device layer (could someone please explain this):

ZFS checksums whole records at a time.  On RAIDZ, each record is
spread over multiple disks, usually the entire RAID stripe.  So when
ZFS detects a checksum error on a record stored in RAIDZ, it doesn't
know which individual disk was actually responsible.  Instead, it
blames the RAIDZ vdev.  That's why you have thousands of checksum
errors on your raidz vdevs.  The few checksum errors you have on
individual disks might have come from the labels or uberblocks, which
are not raided.

-Alan