ZFS raidz2, errors in file?

Thu Oct 18 05:20:21 UTC 2012

On 10/17/2012 12:39 PM, Heikki Suonsivu wrote:
> SMART data indicates problems on two other disks, but no indication of
> those are seen in logs (the disks work, but SMART information
> indicates problems).

The problems may be in areas ZFS has not tried to read.

> One disk indeed has pending sector, not unusual and should be survivable:
>
> ------------------------------------------------------------------------
> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE UPDATED 
> WHEN_FAILED RAW_VALUE
> 197 Current_Pending_Sector  0x0032   200   200   000    Old_age  
> Always       -       1
> 198 Offline_Uncorrectable   0x0030   200   200   000    Old_age
> Offline      -       1

That error means one sector is unreadable and a replacement is pending;
replacement will happen when next as the sector is overwritten.  The
contents of that sector are lost (unless some future read succeeds).

> In addition, there seems to be ICRC DMA errors on da0.  Looks nasty,
> but only show up in SMART log, not in /var/log/messages.
>
> ------------------------------------------------------------------------
> 199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age  
> Always       -       112

I believe that both of these messages refer to errors in transfers
between the disk and host, not to errors within the disk.  Test your
cabling and enclosures.

> SMART Error Log Version: 1
> ATA Error Count: 112 (device log contains only the most recent five
> errors)

I don't like these at all.  Consider replacing that disk.

> If the da0 ICRC errors would have been seen by ZFS, it should have
> made a) note of that in some log?  b) retried write?  c) Something
> else?  If we assume that the disk firmware is broken and does not
> report these to OS, so da0 might be corrupt.  But that should still be
> ok with raidz2.

These errors should trigger retries in layers beneath ZFS

> We do have 3 random SCSI timeouts, which were seen by FreeBSD, and
> thus should have prompted ZFS do handle the errors, and one read error
> on data, which is not reported as read error in any log, other than
> disk's SMART info says so.

The retries may have happened at layer below ZFS.

ZFS does not call the disk driver directly.  Other layers play a role in
error handing.