weird bug with ZFS and SLOG

Tue Dec 6 12:19:42 UTC 2011

On 12/05/2011 11:07 PM, Adam Stylinski wrote:
> The worst case scenario happened to me where my dedicated SLOG decided to drop off the controller and thus prevent me from importing my pool.  I quickly upgrade to FreeBSD 9.0-RC2 after testing this scenario in a VM.  It has worked successfully in a VM, but it is not working on my hardware for whatever reason.  I rollback the pool with zpool import -F share, seems ok, files are there, finishes scrub, very little corruption.  I upgrade the pool to V28, and the fs's to v5.  I then do a:
> 	 zpool remove share 15752248745115926170
>
> 	It returns no errors and pretends like the operation worked, it even appends it to my zpool history.  However, when I do a zpool status, this is what I get:
>
> [adam at nasbox ~]$ zpool status
>   pool: share
>  state: DEGRADED
> status: One or more devices has experienced an error resulting in data
>         corruption.  Applications may be affected.
> action: Restore the file in question if possible.  Otherwise restore the
>         entire pool from backup.
>    see: http://www.sun.com/msg/ZFS-8000-8A
>  scan: scrub repaired 0 in 8h57m with 0 errors on Mon Dec  5 12:48:28 2011
> config:
>
>         NAME                    STATE     READ WRITE CKSUM
>         share                   DEGRADED     0     0     0
>           raidz1-0              ONLINE       0     0     0
>             ada4                ONLINE       0     0     0
>             ada1                ONLINE       0     0     0
>             *ada2*                ONLINE       0     0     0
>             ada3                ONLINE       0     0     0
>           raidz1-1              ONLINE       0     0     0
>             da3                 ONLINE       0     0     0
>             da0                 ONLINE       0     0     0
>             da2                 ONLINE       0     0     0
>             da1                 ONLINE       0     0     0
>           raidz1-2              ONLINE       0     0     0
>             aacd0               ONLINE       0     0     0
>             aacd1               ONLINE       0     0     0
>             aacd2               ONLINE       0     0     0
>             aacd3               ONLINE       0     0     0
>           raidz1-4              ONLINE       0     0     0
>             aacd4               ONLINE       0     0     0
>             aacd5               ONLINE       0     0     0
>             aacd6               ONLINE       0     0     0
>             aacd7               ONLINE       0     0     0
>         logs
>           15752248745115926170  UNAVAIL      0     0     0  was /dev/*ada2*
This looks like another case of not using labels. (see that share has
ada2 in the list, but the log "was /dev/ada2"; they must have
switched... maybe they also resilvered and your log is overwritten)

I did the same thing when I started on FreeBSD and ZFS... nobody warned
me either. When you reboot, sometimes the disks move around and change
numbers. Maybe they are reliable with onboard SATA ports (from my
experience), but with more io cards, removable media, expanders, etc.
they don't seem to ever stay put for me. For me, only the first disk on
the back expander and the first disk on the front expander ever seem to
be the same, and if I add a new disk in the back, the front ones go up
by 1. When a data disk from my pool would switch places with another
data disk from the same pool, zfs would automatically handle it. But
when a hotspare or something else switched places, it would look the
same as you see in your zpool status. "some big number .... UNAVAIL 0 0
0 was /dev/da#"

Here, I wrote you a howto, to explain how to convert to labels:
http://forums.freebsd.org/showthread.php?p=157004

> errors: 3 data errors, use '-v' for a list
>
> Here is the ending output of zpool history:
>
> 2011-12-05.03:38:50 zpool upgrade -V 28 -a
> 2011-12-05.03:39:09 zpool export share
> 2011-12-05.03:39:33 zpool import -m share
> 2011-12-05.03:40:05 zpool remove share 15752248745115926170
> 2011-12-05.03:41:04 zpool remove share 15752248745115926170
> 2011-12-05.03:41:18 zpool export share
> 2011-12-05.03:41:56 zpool import -m share
> 2011-12-05.03:43:47 zpool remove share 15752248745115926170
> 2011-12-05.03:47:54 zpool remove share 15752248745115926170
> 2011-12-05.03:51:20 zpool scrub share
> 2011-12-05.16:33:01 zfs create share/vardb2
> 2011-12-05.16:33:32 zfs set compression=gzip-9 share/vardb2
> 2011-12-05.16:33:38 zfs set atime=off share/vardb2
> 2011-12-05.16:39:37 zfs destroy share/vardb
> 2011-12-05.16:39:47 zfs rename share/vardb2 share/vardb
> 2011-12-05.16:39:53 zfs set mountpoint=/var/db share/vardb
> 2011-12-05.16:47:24 zpool clear share
> 2011-12-05.16:48:41 zpool remove share 15752248745115926170
> 2011-12-05.16:53:15 zpool export -f share
> 2011-12-05.16:55:21 zpool import -m share
> 2011-12-05.16:55:52 zpool remove share 15752248745115926170
> 2011-12-05.16:56:56 zpool remove share -f 15752248745115926170
> 2011-12-05.17:04:07 zpool remove share 15752248745115926170
>
> What is going on here and how do I fix it?  
>

-- 

--------------------------------------------
Peter Maloney
Brockmann Consult
Max-Planck-Str. 2
21502 Geesthacht
Germany
Tel: +49 4152 889 300
Fax: +49 4152 889 333
E-mail: peter.maloney at brockmann-consult.de
Internet: http://www.brockmann-consult.de
--------------------------------------------