zpool devices "stuck" (was zpool resilver restarting)

Sat Dec 27 11:59:59 PST 2008

On Fri, 26 Dec 2008, Wes Morgan wrote:

> On Fri, 26 Dec 2008, Wes Morgan wrote:
>
>> I just did a zpool replace on a new drive, and now it's resilvering.
>> Only, when it gets about 20mb resilvered it restarts. I can see all the 
>> drive activity simply halting for a period then resuming in gstat. I see 
>> some bugs in the opensolaris tracker about this, but no resolutions. It 
>> doesn't seem to be related to calling "zpool status" because I can watch 
>> gstat and see it restarting... Anyone seen this before, and hopefully have 
>> a workaround...?
>> 
>> The pool lost a drive on Wednesday and was running with a device missing, 
>> however due to the device numbering changing on the scsi bus, I had to 
>> export/import the pool to get it to come up, the same for after replacing 
>> it.
>
> Replying to myself with some more information. zpool history -l -i shows the 
> scrub loop happening:
>
> 2008-12-26.21:39:46 [internal pool scrub done txg:6463875] complete=0 [user 
> root on volatile]
> 2008-12-26.21:39:46 [internal pool scrub txg:6463875] func=1 mintxg=3 
> maxtxg=6463720 [user root on volatile]
> 2008-12-26.21:41:23 [internal pool scrub done txg:6463879] complete=0 [user 
> root on volatile]
> 2008-12-26.21:41:23 [internal pool scrub txg:6463879] func=1 mintxg=3 
> maxtxg=6463720 [user root on volatile]
> 2008-12-26.21:43:00 [internal pool scrub done txg:6463883] complete=0 [user 
> root on volatile]
> 2008-12-26.21:43:00 [internal pool scrub txg:6463883] func=1 mintxg=3 
> maxtxg=6463720 [user root on volatile]
> 2008-12-26.21:44:38 [internal pool scrub done txg:6463887] complete=0 [user 
> root on volatile]
> 2008-12-26.21:44:38 [internal pool scrub txg:6463887] func=1 mintxg=3 
> maxtxg=6463720 [user root on volatile]

It seems that the resliver and drive replacement were "fighting" each 
other somehow. Detaching the new drive allowed the resilver to complete, 
but now I'm stuck with two nonexistent devices trying to replace each 
other, and I can't replace a device that is being replaced:

             replacing               UNAVAIL      0 36.4K     0  insufficient replicas
               17628927049345412941  FAULTED      0     0     0  was /dev/da4
               5474360425105728553   FAULTED      0     0     0  was /dev/da4

errors: No known data errors

So, how the heck do I cancel that replacement and restart it using 
/dev/da4?