Errors on a file on a zpool: How to remove?

Sun Jan 24 13:25:16 UTC 2010

On Sun, 24 Jan 2010, jhell wrote:

>
> On Sun, 24 Jan 2010 00:28, rincebrain@ wrote:
> > On Sun, Jan 24, 2010 at 12:15 AM, jhell <jhell at dataix.net> wrote:
> > > From what I see and what was already mentioned earlier in this thread is
> > > meta data corruption but the checksum errors do not span across the whole
> > > pool of vdevs. These are, correct me if I am wrong USB mass storage
> > > devices
> > > ? SSD ?
> >
> > 1.5T Seagate 7200RPM drives.
> >
> > > In the arrangement of the devices on the system are da2,4,5 on the same
> > > hub
> > > and da6,7 on another ? If this is the case you may have consolidated your
> > > errors down to being a USB problem and narrowed down to where they are
> > > connected to.
> >
> > ...no.
> >
> > All five are on the same SATA controller. These behaviors persist
> > independent of which SATA controller they are plugged into, and I've
> > tried all seven in the machine.
> >
> > > What happened to da1,3 ? Were these once connected to the system ? and if
> > > so
> > > did you start noticing this problem occur roughly about the same period
> > > they
> > > were removed ?
> >
> > da1,3 are being used in another disk pool, and were never a part of this
> > pool.
> >
> > This is not an issue of a faulty SATA controller or SATA drives.
> >
> > This is an issue of "there was a single faulty stick of RAM in the machine".
> >
>
> Yeah I read this earlier, My apologies it slipped while I was writing "mind
> went into multi-write single read mode".
>
> > I have sixteen disks in this machine. These three are having issues
> > only on these particular files, and only on these files, not on random
> > portions of the disk. The disks never report read errors - the ZFS
> > layer is what reports them. SMART is not reporting any difficulties in
> > reading any sectors of these disks.
> >
> >
> > I could be mistaken, but I do not believe there to be a faulty
> > controller in play at this time. I've rotated the drives among the
> > spares of the 24 ports on the SATA controller in question, as well as
> > the on-motherboard controller, and this behavior has persisted.
> >
> > - Rich
> >
>
> As I was thinking earlier... you mentioned you scrubbed multiple times with no
> difference. When I was mentioning the attempt to remove/replace I was thinking
> this will cause a "re-silvering" of the drives possibly fixing meta-data for
> the effected disks if good meta-data still exists somewhere.
>
> Might be worth a shot but I would start with the replace of the devices that
> are showing the errors until you can clear the errors successfully without
> them showing up again and/or until you have replaced all disks.

This is a non-redundant pool. The remove command will not work. Replace
will, but for that pool to function at all, *every* device must be
present. If the metadata was recoverable, I think that the scrub would
have reported "xxx kb repaired".

>From http://dlc.sun.com/osol/docs/content/ZFSADMIN/gbbwl.html:

    If the object number to a file path cannot be successfully translated,
    either due to an error or because the object doesn't have a real file
    path associated with it , as is the case for a dnode_t, then the
    dataset name followed by the object's number is displayed. For
    example:

     monkey/dnode:<0x0>

Which seems to be precisely your error. Continuing:

    Then, try removing the file with the rm command. If this command
    doesn't work, the corruption is within the file's metadata, and ZFS
    cannot determine which blocks belong to the file in order to remove
    the corruption.

    If the corruption is within a directory or a file's metadata, the only
    choice is to move the file elsewhere. You can safely move any file or
    directory to a less convenient location, allowing the original object
    to be restored in place."

In other words, either move the files out of the way or restore the pool.
I'd wager that any other filesystem would have simply wiped out entire
directory trees or possibly just panicked with this kind of corruption.