ZFS: Can't repair raidz2 (Cannot replace a replacing device)

Wes Morgan morganw at chemikals.org
Mon Dec 28 04:27:39 UTC 2009


On Sun, 27 Dec 2009, Steven Schlansker wrote:

>
> On Dec 24, 2009, at 5:17 AM, Wes Morgan wrote:
>
>> On Wed, 23 Dec 2009, Steven Schlansker wrote:
>>>
>>> Why has the replacing vdev not gone away?  I still can't detach -
>>> [steven at universe:~]% sudo zpool detach universe 6170688083648327969
>>> cannot detach 6170688083648327969: no valid replicas
>>> even though now there actually is a valid replica (ad26)
>>
>> Try detaching ad26. If it lets you do that it will abort the replacement and then you just do another replacement with the real device. If it won't let you do that, you may be stuck having to do some metadata tricks.
>>
>
> Sadly, no go:
>
>  pool: universe
> state: DEGRADED
> scrub: none requested
> config:
>
>        NAME                       STATE     READ WRITE CKSUM
>        universe                   DEGRADED     0     0     0
>          raidz2                   DEGRADED     0     0     0
>            ad16                   ONLINE       0     0     0
>            replacing              DEGRADED     0     0 5.04K
>              ad26                 ONLINE       0     0     0
>              6170688083648327969  UNAVAIL      0 1.08M     0  was /dev/ad12
>            ad8                    ONLINE       0     0     0
>            concat/back2           ONLINE       0     0     0
>            ad10                   ONLINE       0     0     0
>            concat/ad4ex           ONLINE       0     0     0
>            ad24                   ONLINE       0     0     0
>            concat/ad6ex           ONLINE       0     0     0
>
> errors: No known data errors
> [steven at universe:~]% sudo zpool detach universe ad26
> cannot detach ad26: no valid replicas
> [steven at universe:~]% sudo zpool offline -t universe ad26
> cannot offline ad26: no valid replicas
>

Hmm. Looking through the spa_vdev_detach() code in

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c

it is "failing" on either line 3046 or 3072. If it's failing on 3046, the 
bug would appear to be that it doesn't count the missing device as a child 
and allow you to detach it. In that case, a "hack" might be to 
bypass it by changing line 3045 to:

         if (pvd->vdev_children == 0)

If the failure is on 3072, then somehow the original device is not being 
counted as a valid copy, so it won't allow you to detach. That check looks 
like it would be dangerous to bypass.

Based on my experience with this failure, I'm betting the device counting 
is off and it's returning on line 3045. You might try inserting some 
debugging kernel printf's there or using kdb to step through it and see. 
If it is, I think bypassing 3045 might let you detach the nonexistent 
device. Of course, back up your data before attempting anything of the 
sort!!


More information about the freebsd-fs mailing list