zpool devices "stuck" (was zpool resilver restarting)

Sun Dec 28 22:38:54 PST 2008

On Sat, 27 Dec 2008, Wes Morgan wrote:

> On Fri, 26 Dec 2008, Wes Morgan wrote:
>
>> On Fri, 26 Dec 2008, Wes Morgan wrote:
>> 
>>> I just did a zpool replace on a new drive, and now it's resilvering.
>>> Only, when it gets about 20mb resilvered it restarts. I can see all the 
>>> drive activity simply halting for a period then resuming in gstat. I see 
>>> some bugs in the opensolaris tracker about this, but no resolutions. It 
>>> doesn't seem to be related to calling "zpool status" because I can watch 
>>> gstat and see it restarting... Anyone seen this before, and hopefully have 
>>> a workaround...?
>>> 
>>> The pool lost a drive on Wednesday and was running with a device missing, 
>>> however due to the device numbering changing on the scsi bus, I had to 
>>> export/import the pool to get it to come up, the same for after replacing 
>>> it.
>> 
>> Replying to myself with some more information. zpool history -l -i shows 
>> the scrub loop happening:
>> 
>> 2008-12-26.21:39:46 [internal pool scrub done txg:6463875] complete=0 [user 
>> root on volatile]
>> 2008-12-26.21:39:46 [internal pool scrub txg:6463875] func=1 mintxg=3 
>> maxtxg=6463720 [user root on volatile]
>> 2008-12-26.21:41:23 [internal pool scrub done txg:6463879] complete=0 [user 
>> root on volatile]
>> 2008-12-26.21:41:23 [internal pool scrub txg:6463879] func=1 mintxg=3 
>> maxtxg=6463720 [user root on volatile]
>> 2008-12-26.21:43:00 [internal pool scrub done txg:6463883] complete=0 [user 
>> root on volatile]
>> 2008-12-26.21:43:00 [internal pool scrub txg:6463883] func=1 mintxg=3 
>> maxtxg=6463720 [user root on volatile]
>> 2008-12-26.21:44:38 [internal pool scrub done txg:6463887] complete=0 [user 
>> root on volatile]
>> 2008-12-26.21:44:38 [internal pool scrub txg:6463887] func=1 mintxg=3 
>> maxtxg=6463720 [user root on volatile]
>
>
> It seems that the resliver and drive replacement were "fighting" each other 
> somehow. Detaching the new drive allowed the resilver to complete, but now 
> I'm stuck with two nonexistent devices trying to replace each other, and I 
> can't replace a device that is being replaced:
>
>            replacing               UNAVAIL      0 36.4K     0  insufficient 
> replicas
>              17628927049345412941  FAULTED      0     0     0  was /dev/da4
>              5474360425105728553   FAULTED      0     0     0  was /dev/da4
>
> errors: No known data errors
>
> So, how the heck do I cancel that replacement and restart it using /dev/da4?

Ok, dear sweet mercy, I think I've dug myself out of the huge hole. I 
found a bug in the opensolaris tracker that is basically the same as my 
issue:

http://bugs.opensolaris.org/view_bug.do?bug_id=6782540

So, I spent most of the weekend trying to figure out how to repair the 
damage. I ended up re-creating the actual zfs disk label for the 547xxx 
device and dumping that onto the drive. After some trouble with checksums, 
the system came back to life a few hours ago and I thought I was out of 
the woods when the resilver started up. However, I was not... I had simply 
got myself back into the resilver loop that I could not stop. Back to the 
drawing board...

Using gvirstor, created a 500gb volume (with only 100gb available to back 
it), dumped the label of the 176xxxx device onto it, export/import and 
then the resilver starts back up. Checking gstat showed that the true 
device was not being written to at all, so I realized that it was going to 
try to resliver the 176 device first before doing the replacement. Not 
good... After some more floundering, I discovered that I could "zpool 
detach" the virstor volume, leaving me with only real devices in the pool. 
Except now it did not want to do a complete and true resilver, only 
resilvering a tiny bit of data, about 20mb or something. My wild guess is 
that it might have something to do with tgx id's and how the resilver 
tries to only do the data that is "new". Since there is no way (that I 
know of) to force a resilver with zpool, I simply started scrubbing the 
array. This would probably have worked, but it was going to take far too 
long, and was simply throwing up millions of checksum errors on the new 
drive. So I cancelled the scrub and figured I could just offline the drive 
and replace it with itself... Nope, no dice, it was reported as "busy". 
However, after mucking around with the label some more, I was able to 
finally get the drive to replace itself and start resilvering. Hopefully 
it will finish successfully.

I'm still not sure what went wrong. Part of what happened seems to be 
related to scsi devices not being wired down like atapi devices, so 
successive reboots replaced "offline" devices with "faulted", and the pool 
kept trying to write to them, just generating more errors.

Do the folks on the opensolaris zfs-discuss take reports from FreeBSD 
users, or do they just toss it back at you? I did actually boot an 
opensolaris live cd at one point, but it couldn't match the vdevs with 
devices well enough to import the pool. I don't think it would have 
handled it properly anyway, given the bug I found in their database.

Hope no one ever has to deal with this themselves! Whew...