ZFS errors on the array but not the disk.

Sat Oct 25 03:47:44 UTC 2014

Thanks for the heads up.  I'm following releng/10.1 and 271683 seems to be
part of that, but a good catch/guess.

On Fri, Oct 24, 2014 at 11:02 PM, Steven Hartland <smh at freebsd.org> wrote:

> There was an issue which would cause resilver restarts fixed by *265253* <
> https://svnweb.freebsd.org/base?view=revision&revision=265253> which was
> MFC'ed to stable/10 by *271683* <https://svnweb.freebsd.org/
> base?view=revision&revision=271683>so you'll want to make sure your
> latter than that.
>
>
> On 24/10/2014 19:42, Zaphod Beeblebrox wrote:
>
>> I manually replaced a disk... and the array was scrubbed recently.
>> Interestingly, I seem to be in the "endless loop"  of resilvering problem.
>> Not much I can find on it.  but resilvering will complete and I can then
>> run another scrub.  It will complete, too.  Then rebooting causes another
>> resilvering.
>>
>> Another odd data point: it seems as if the things that show up as "errors"
>> change from resilvering to resilvering.
>>
>> One bug, it would seem, is that once ZFS has detected an error... another
>> scrub can reset it, but no attempt is made to read-through the error if
>> you
>> access the object directly.
>>
>> On Fri, Oct 24, 2014 at 11:33 AM, Alan Somers <asomers at freebsd.org>
>> wrote:
>>
>>  On Thu, Oct 23, 2014 at 11:37 PM, Zaphod Beeblebrox <zbeeble at gmail.com>
>>> wrote:
>>>
>>>> What does it mean when checksum errors appear on the array (and the
>>>> vdev)
>>>> but not on any of the disks?  See the paste below.  One would think that
>>>> there isn't some ephemeral data stored somewhere that is not one of the
>>>> disks, yet "cksum" errors show only on the vdev and the array lines.
>>>>
>>> Help?
>>>
>>>> [2:17:316]root at virtual:/vr2/torrent/in> zpool status
>>>>    pool: vr2
>>>>   state: ONLINE
>>>> status: One or more devices is currently being resilvered.  The pool
>>>> will
>>>>          continue to function, possibly in a degraded state.
>>>> action: Wait for the resilver to complete.
>>>>    scan: resilver in progress since Thu Oct 23 23:11:29 2014
>>>>          1.53T scanned out of 22.6T at 62.4M/s, 98h23m to go
>>>>          119G resilvered, 6.79% done
>>>> config:
>>>>
>>>>          NAME               STATE     READ WRITE CKSUM
>>>>          vr2                ONLINE       0     0    36
>>>>            raidz1-0         ONLINE       0     0    72
>>>>              label/vr2-d0   ONLINE       0     0     0
>>>>              label/vr2-d1   ONLINE       0     0     0
>>>>              gpt/vr2-d2c    ONLINE       0     0     0  block size: 512B
>>>> configured, 4096B native  (resilvering)
>>>>              gpt/vr2-d3b    ONLINE       0     0     0  block size: 512B
>>>> configured, 4096B native
>>>>              gpt/vr2-d4a    ONLINE       0     0     0  block size: 512B
>>>> configured, 4096B native
>>>>              ada14          ONLINE       0     0     0
>>>>              label/vr2-d6   ONLINE       0     0     0
>>>>              label/vr2-d7c  ONLINE       0     0     0
>>>>              label/vr2-d8   ONLINE       0     0     0
>>>>            raidz1-1         ONLINE       0     0     0
>>>>              gpt/vr2-e0     ONLINE       0     0     0  block size: 512B
>>>> configured, 4096B native
>>>>              gpt/vr2-e1     ONLINE       0     0     0  block size: 512B
>>>> configured, 4096B native
>>>>              gpt/vr2-e2     ONLINE       0     0     0  block size: 512B
>>>> configured, 4096B native
>>>>              gpt/vr2-e3     ONLINE       0     0     0
>>>>              gpt/vr2-e4     ONLINE       0     0     0  block size: 512B
>>>> configured, 4096B native
>>>>              gpt/vr2-e5     ONLINE       0     0     0  block size: 512B
>>>> configured, 4096B native
>>>>              gpt/vr2-e6     ONLINE       0     0     0  block size: 512B
>>>> configured, 4096B native
>>>>              gpt/vr2-e7     ONLINE       0     0     0  block size: 512B
>>>> configured, 4096B native
>>>>
>>>> errors: 43 data errors, use '-v' for a list
>>>>
>>> The checksum errors will appear on the raidz vdev instead of a leaf if
>>> vdev_raidz.c can't determine which leaf vdev was responsible.  This
>>> could happen if two or more leaf vdevs return bad data for the same
>>> block, which would also lead to unrecoverable data errors.  I see that
>>> you have some unrecoverable data errors, so maybe that's what happened
>>> to you.
>>>
>>> Subtle design bugs in ZFS can also lead to vdev_raidz.c being unable
>>> to determine which child was responsible for a checksum error.
>>> However, I've only seen that happen when a raidz vdev has a mirror
>>> child.  That can only happen if the child is a spare or replacing
>>> vdev.  Did you activate any spares, or did you manually replace a
>>> vdev?
>>>
>>> -Alan
>>>
>>>  _______________________________________________
>> freebsd-fs at freebsd.org mailing list
>> http://lists.freebsd.org/mailman/listinfo/freebsd-fs
>> To unsubscribe, send any mail to "freebsd-fs-unsubscribe at freebsd.org"
>>
>>
>>
> _______________________________________________
> freebsd-fs at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-fs
> To unsubscribe, send any mail to "freebsd-fs-unsubscribe at freebsd.org"
>