When a ZFS error is not an error.

Mon Nov 24 01:00:52 UTC 2014

So... I recovered a 2nd file... this time a rar file.  It fails the
checksum, but I'm unsure of how I'm extracting the file.  The array
consists of two vdevs: 1 of 9 disks and the other of 8 disks, both raidz1.
I'm pretty sure that the 9 drive vdev is '0' in this output ... as the 9
drive vdev lists first in the zpool status output (and was created first).

Anyways, here's two lines from the verbose zdb output:

          100000   L0 1:720cfa33400:24c00 20000L/20000P F=1
B=11756828567/11756828567
          120000   L0 0:94048dc9000:24000 20000L/20000P F=1
B=11756828567/11756828567

... What I'm not clear on is why the 8 drive vdev is writing 24c00 bytes
and the 9 drive vdev is writing 24000 bytes.

And ... in either case, am I to fetch the first 20000 bytes?  ie: zdb -R
0:94048dc9000:20000 and zdb -R 1:720cfa33400:20000  ?

If, when I read these blocks with zdb, the filesystem is reporting to
checksum errors, am I getting the right data?  Do I need to process the
parity?

On Sat, Nov 22, 2014 at 6:56 PM, Zaphod Beeblebrox <zbeeble at gmail.com>
wrote:

> I have a file that ZFS claims is in error that when I go through all the
> effort to retrieve it, is not in error.  I have 405 files, then, that zfs
> says are in error on this array and since some are rather large and since
> retrieving one block seems to take 30 seconds (ie: hundreds of hours of
> time to recover some files), I'd like to ask if there's some way to finesse
> this... or to fix zfs.
>
> To start, my array has errors like:
>
>         NAME               STATE     READ WRITE CKSUM
>         vr2                ONLINE       0     0   989
>           raidz1-0         ONLINE       0     0 1.93K
>             label/vr2-d0   ONLINE       0     0     0
>
> (I've omitted the other lines ... they all '0').  I asked what this meant
> ... and the best I got was that the errors were not assigned to any
> particular device.  So I learned how to use ZDB and I have a patch for
> ZDB.  Apparently the deadlist can have a null in it that crashes ZDB.
>
> No matter.  We have this file in the output of zpool status -v:
>
> vr2/Audio at 20080305-1450:/cds/service/02-Lord_Have_Mercy_Kyrie.mp3
>
> ... now even though it picks on the snapshot (not all of the -v reports
> do), the following fails:
>
> [1:170:470]root at virtual:/vr1/tmp/diag> cp
> /vr2/Audio/cds/service/02-Lord_Have_Mercy_Kyrie.mp3 .
> cp: foo.mp3: Bad address
>
> So I did this:
>
> for i in `grep L0 4351-dddddddd.txt | grep -v vr2/Audio | head -50 | cut
> -c22-34`; do cc=`printf %05d $count`; echo getting $i 4035/b$cc; time zdb
> -R vr2 $i:20000:r >4035/b$cc & count=$[count+1]; done
>
> --- basically, 4351-dddddddd.txt is the output of zdb for that file (see
> http://pastebin.com/tdqEJKJB) and the little script calls zdb to get the
> first 20000 (hex) of each block because the remaining 4000 is the parity (9
> disk array).
>
> Then I cat it into one file, then I truncate it to the specified length
> ....
>
> and lo and behold: The file is sound.
>
> So what's ZFS on about not wanting to read this file?  Help?
>