ZFS: Can't repair raidz2 (Cannot replace a replacing device)

Mon Dec 28 04:59:39 UTC 2009

On Sun, 27 Dec 2009, Steven Schlansker wrote:

>
> On Dec 24, 2009, at 5:17 AM, Wes Morgan wrote:
>
>> On Wed, 23 Dec 2009, Steven Schlansker wrote:
>>>
>>> Why has the replacing vdev not gone away?  I still can't detach -
>>> [steven at universe:~]% sudo zpool detach universe 6170688083648327969
>>> cannot detach 6170688083648327969: no valid replicas
>>> even though now there actually is a valid replica (ad26)
>>
>> Try detaching ad26. If it lets you do that it will abort the replacement and then you just do another replacement with the real device. If it won't let you do that, you may be stuck having to do some metadata tricks.
>>
>
> Sadly, no go:
>
>  pool: universe
> state: DEGRADED
> scrub: none requested
> config:
>
>        NAME                       STATE     READ WRITE CKSUM
>        universe                   DEGRADED     0     0     0
>          raidz2                   DEGRADED     0     0     0
>            ad16                   ONLINE       0     0     0
>            replacing              DEGRADED     0     0 5.04K
>              ad26                 ONLINE       0     0     0
>              6170688083648327969  UNAVAIL      0 1.08M     0  was /dev/ad12
>            ad8                    ONLINE       0     0     0
>            concat/back2           ONLINE       0     0     0
>            ad10                   ONLINE       0     0     0
>            concat/ad4ex           ONLINE       0     0     0
>            ad24                   ONLINE       0     0     0
>            concat/ad6ex           ONLINE       0     0     0
>
> errors: No known data errors
> [steven at universe:~]% sudo zpool detach universe ad26
> cannot detach ad26: no valid replicas
> [steven at universe:~]% sudo zpool offline -t universe ad26
> cannot offline ad26: no valid replicas
>

I just tried to re-create this scenario with some sparse files and I was 
able to detach it completely (below). There is one difference, however. 
Your array is returning checksum errors for the ad26 device. Perhaps this 
is making the system think that there is no sibling device in the 
replacement node that has all the data, so it denies the detach. Even 
though logically the data will be recovered by a scrub later.. 
Interesting. If you can determine where the detach is failing, that will 
help paint the complete picture.

[root at catalyst:~#]: zpool status testz2
   pool: testz2
  state: DEGRADED
  scrub: none requested
config:

         NAME                       STATE     READ WRITE CKSUM
         testz2                     DEGRADED     0     0     0
           raidz2                   DEGRADED     0     0     0
             md1                    ONLINE       0     0     0
             md2                    ONLINE       0     0     0
             replacing              DEGRADED     0     0     0
               md3                  ONLINE       0     0     0
               8502561034916233095  UNAVAIL      0   323     0  was /dev/md7
             md4                    ONLINE       0     0     0
             md5                    ONLINE       0     0     0
             md6                    ONLINE       0     0     0

errors: No known data errors
[root at catalyst:~#]: zpool detach testz2 8502561034916233095
[root at catalyst:~#]: zpool status testz2
   pool: testz2
  state: ONLINE
  scrub: none requested
config:

         NAME        STATE     READ WRITE CKSUM
         testz2      ONLINE       0     0     0
           raidz2    ONLINE       0     0     0
             md1     ONLINE       0     0     0
             md2     ONLINE       0     0     0
             md3     ONLINE       0     0     0
             md4     ONLINE       0     0     0
             md5     ONLINE       0     0     0
             md6     ONLINE       0     0     0

errors: No known data errors
[root at catalyst:~#]: