ZFS "zpool replace" problems

Tue Jan 26 14:30:24 UTC 2010

I'm removing the In-Reply-To mail headers for this thread, as you've now
hijacked it for a different purpose.  Please don't do this; start a new
thread altogether.  :-)

On Tue, Jan 26, 2010 at 02:57:20PM +0100, Gerrit Kühn wrote:
> I am still busy replacing RE2-disks with updated drives. I came across a
> very strange thing with zfs. Actually I had the following pool layout:
> 
> mclane# zpool status
>   pool: tank
>  state: ONLINE
>  scrub: none requested
> config:
> 
>         NAME        STATE     READ WRITE CKSUM
>         tank        ONLINE       0     0     0
>           raidz1    ONLINE       0     0     0
>             ad8     ONLINE       0     0     0
>             ad10    ONLINE       0     0     0
>             ad12    ONLINE       0     0     0
>         spares
>           ad14      AVAIL   
> 
> errors: No known data errors
> 
> All disks still have the firmware bug, so I want to replace them with
> disks that I already fixed. I put in a updated drive as ad18 and
> wanted to replace ad12 to get the drive with the broken firmware out:
> 
> mclane# zpool replace tank /dev/ad12 /dev/ad18 
> mclane# zpool status
>   pool: tank
>  state: ONLINE
> status: One or more devices is currently being resilvered.  The pool will
>         continue to function, possibly in a degraded state.
> action: Wait for the resilver to complete.
>  scrub: resilver in progress for 0h0m, 0.01% done, 52h51m to go
> config:
> 
>         NAME           STATE     READ WRITE CKSUM
>         tank           ONLINE       0     0     0
>           raidz1       ONLINE       0     0     0
>             ad8        ONLINE       0     0     0  7.21M resilvered
>             ad10       ONLINE       0     0     0  7.22M resilvered
>             replacing  ONLINE       0     0     0
>               ad12     ONLINE       0     0     0
>               ad18     ONLINE       0     0     0  10.7M resilvered
>         spares
>           ad14         AVAIL   
> 
> errors: No known data errors
> 
> However, something must have gone wrong during the resilvering process and
> it now looks like this:
> 
> mclane# zpool status
>   pool: tank
>  state: DEGRADED
> status: One or more devices has experienced an unrecoverable error.  An
>         attempt was made to correct the error.  Applications are
> unaffected. action: Determine if the device needs to be replaced, and
> clear the errors using 'zpool clear' or replace the device with 'zpool
> replace'. see: http://www.sun.com/msg/ZFS-8000-9P
>  scrub: resilver completed after 2h39m with 0 errors on Tue Jan 26
> 14:00:00 2010 config:
> 
>         NAME           STATE     READ WRITE CKSUM
>         tank           DEGRADED     0     0     0
>           raidz1       DEGRADED     0     0     0
>             ad8        ONLINE       0     0     0  975M resilvered
>             ad10       ONLINE       0     0   142  974M resilvered
>             replacing  DEGRADED     0 7.25M     0
>               ad12     ONLINE       0     0     0
>               ad18     REMOVED      0     1     0  79.4M resilvered
>         spares
>           ad14         AVAIL   
> 
> errors: No known data errors
> 
> 
> What is going on here? ad18 obviously detached during the
> process. /var/log/messages just gives me
> 
> Jan 26 11:23:33 mclane kernel: ad18: FAILURE - device detached
> 
> Additionally ad10 obviously produced chksum errors. What do I do about the
> degraded replacing process? Can I terminate it somehow and maybe replace
> ad10 first? Any other hints?

I'm not sure how the above is supposed to work (I haven't personally
tried it), but:

1) Why didn't you offline the ad10 disk first?
   zpool offline tank ad10

2) How did you attach ad18?  Did you tell the system about it using
   atacontrol?  If so, what commands did you use?

3) Can you please provide uname -a output, as well as relevant dmesg
   output to show what kind of SATA controller you have, what's
   attached to what, etc.?

-- 
| Jeremy Chadwick                                   jdc at parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.              PGP: 4BD6C0CB |