ZFS pool permanent error question -- errors: Permanent errors have been detected in the following files: storage: <0x0>

Mon Jun 16 02:49:55 UTC 2014

On Sun, Jun 15, 2014 at 05:10:52PM -0400, kpneal at pobox.com wrote:
> On Sun, Jun 15, 2014 at 03:04:16PM +1000, Anders Jensen-Waud wrote:
> > Hi all,
> > 
> > My main zfs storage pool (named ``storage'') has recently started
> > displaying a very odd error:
> > 
> > root at beastie> zpool status  -v
> >           /
> > 
> >   pool: backup
> >  state: ONLINE
> >   scan: none requested
> > config:
> > NAME        STATE     READ WRITE CKSUM
> > backup      ONLINE       0     0     0
> >   da1       ONLINE       0     0     0
> > errors: No known data errors
> >   pool: storage
> >  state: ONLINE
> > status: One or more devices has experienced an error resulting in data
> > corruption.  Applications may be affected.
> > action: Restore the file in question if possible.  Otherwise restore the
> > entire pool from backup.
> >    see: http://illumos.org/msg/ZFS-8000-8A
> >   scan: scrub in progress since Sun Jun 15 14:18:45 2014
> >         34.3G scanned out of 839G at 19.3M/s, 11h50m to go
> >         72K repaired, 4.08% done
> > config:
> > NAME        STATE     READ WRITE CKSUM
> > storage     ONLINE       0     0     0
> >   da0       ONLINE       0     0     0  (repairing)
> > 
> > errors: Permanent errors have been detected in the following files:
> >         storage:<0x0>
> 
> I'm not sure what causes ZFS to lose the filename like this. I'll let
> someone else comment. I want to say you have a corrupt file in a snapshot,
> but don't hold me to that.
> 
> It looks like you are running ZFS with pools consisting of a single disk.
> In cases like this if ZFS detects that a file has been corrupted ZFS is
> unable to do anything to fix it. Run with the option "copies=2" to have
> two copies of every file if you want ZFS to be able to fix broken files.
> Of course, this doubles the amount of space you will use, so you have to
> think about how important your data is to you.

Thank you for the tip. I didn't know about copies=2, so I will
definitely consider that option. 

I am running ZFS on a single disk -- a 1 TB USB drive -- attached to my
"server" at home. It is not exactly an enterprise server, but it fits
well for my home purposes, namely file backup from my different
computers. On a nightly basis I then copy and compress the data sets
from storage to another USB drive to have a second copy. In this
instance, the nightly backup script (zfs send/recv based) hadn't run
properly so I had no backup to recover from. 

Given that my machine only has 3 GB RAM, I was wondering if the issue
might be memory related and if I am better off converting the volume
back to UFS. I am keen to stay on ZFS to benefit from snapshots,
compression, security etc. Any thoughts?

> 
> I don't know what caused the corrupt file. It could be random chance, or
> it could be that you accidentally did something to damage the pool. I say
> that because:
> 
> > da1 at umass-sim1 bus 1 scbus4 target 0 lun 0
> > da1: <Seagate FreeAgent Go 102D> Fixed Direct Access SCSI-4 device
> > da1: Serial Number 2GE1GTVM
> > da1: 40.000MB/s transfers
> > da1: 476940MB (976773168 512 byte sectors: 255H 63S/T 60801C)
> > da1: quirks=0x2<NO_6_BYTE>
> > GEOM: da1: the primary GPT table is corrupt or invalid.
> > GEOM: da1: using the secondary instead -- recovery strongly advised.
> > GEOM: diskid/DISK-2GE1GTVM: the primary GPT table is corrupt or invalid.
> > GEOM: diskid/DISK-2GE1GTVM: using the secondary instead -- recovery
> > strongly advised.
> 
> You've got something going on here. Did you GPT partition the disk? The
> zpool status you posted says you built your pools on the entire disk and
> not inside a partition. But GEOM is saying the disk has been partitioned.
> GPT stores data at both the beginning and end of the disk. ZFS may have
> trashed the beginning of the disk but not gotten to the end yet.

This disk is not the ``storage'' zpool -- it is my ``backup'' pool,
which is on a different drive: 

NAME      SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
backup    464G   235G   229G    50%  1.00x  ONLINE  -
storage   928G   841G  87.1G    90%  1.00x  ONLINE  -

Running 'gpt recover /dev/da1' fixes the error above but after a reboot
it reappears. Would it be better to completely wipe the disk and
reinitialise it with zfs? 

Miraculously, an overnight 'zpool scrub storage' has wiped out the errors
from yesterday, and I am puzzled why that is the case. As per the
original zpool status from yesterday, ZFS warned that I needed to
recover all the files from backup

aj at beastie> zpool status                                                                                                ~
  pool: backup
 state: ONLINE
  scan: none requested
config:

  NAME        STATE     READ WRITE CKSUM
  backup      ONLINE       0     0     0
    da1       ONLINE       0     0     0

errors: No known data errors

  pool: storage
 state: ONLINE
  scan: scrub repaired 984K in 11h37m with 0 errors on Mon Jun 16 01:55:48 2014
config:

  NAME        STATE     READ WRITE CKSUM
  storage     ONLINE       0     0     0
    da0       ONLINE       0     0     0

errors: No known data errors

> Running ZFS in a partition or on the entire disk is fine either way. But
> you have to be consistent. Partitioning a disk and then writing outside
> of the partition creates errors like the above GEOM one.

Agree. In this instance it wasn't da0/storage, however.

> -- 
> Kevin P. Neal                                http://www.pobox.com/~kpn/
> "Not even the dumbest terrorist would choose an encryption program that
>  allowed the U.S. government to hold the key." -- (Fortune magazine
>     is smarter than the US government, Oct 29 2001, page 196.)

-- 
Anders Jensen-Waud
E: anders at jensenwaud.com