ZFS I/O errors

Tue May 31 09:26:02 UTC 2011

On Mon 30 May 2011 at 12:19:10 -0500, Dan Nelson wrote:
> The ZFS compression code will panic if it can't allocate the buffer needed
> to store the compressed data, so that's unlikely to be your problem.  The
> only time I have seen an "illegal byte sequence" error was when trying to
> copy raw disk images containing ZFS pools to different disks, and the
> destination disk was a different size than the original.  I wasn't even able
> to import the pool in that case, though.  

Yet somehow some incorrect data got written, it seems. That never
happened before, fortunately, even though we had crashes before that
seemed to be related to ZFS running out of memory.

> The zfs IO code overloads the EILSEQ error code and uses it as a "checksum
> error" code.  Returning that error for the same block on all disks is
> definitely weird.  Could you have run a partitioning tool, or some other
> program that would have done direct writes to all of your component disks?

I hope I would remember doing that if I did!

> Your scrub is also a bit worrying - 24k checksum errors definitely shouldn't
> occur during normal usage.

It turns out that the errors are easy to provoke: they happen every time
I do an ls of of the affected directories. There were processes running
that were likely to be trying to write to the same directories (the file
system is exported over NFS), so in that case it is easy to imagine that
the numbers rack up quickly.

I moved those directories to the side, for the moment, but I haven't
been able to delete them yet. The data is a bit bigger than we're able
to backup so "just restoring a backup" isn't an easy thing to do.
Possibly I could make a new filesystem in the same pool, if that would
do the trick; it isn't more than 50% full but the affected one is the
biggest filesystem in it.

The end result of the scrub is as follows:

  pool: tank
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: scrub completed after 12h56m with 3 errors on Mon May 30 23:56:47 2011
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0 6.38K
          raidz2    ONLINE       0     0 25.4K
            da0     ONLINE       0     0     0
            da1     ONLINE       0     0     0
            da2     ONLINE       0     0     0
            da3     ONLINE       0     0     0
            da4     ONLINE       0     0     0
            da5     ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        tank/vol-fourquid-1:<0x0>
        tank/vol-fourquid-1 at saturday:<0x0>
        /tank/vol-fourquid-1/.zfs/snapshot/saturday/backups/dumps/dump_usr_friday.dump
        /tank/vol-fourquid-1/.zfs/snapshot/saturday/sverberne/CLEF-IP11/parts_abs+desc
        /tank/vol-fourquid-1/.zfs/snapshot/sunday/sverberne/CLEF-IP11/parts_abs+desc
        /tank/vol-fourquid-1/.zfs/snapshot/monday/sverberne/CLEF-IP11/parts_abs+desc

-Olaf.
-- 
Pipe rene = new PipePicture(); assert(Not rene.GetType().Equals(Pipe));