[SOLVED] Re: ZFS replacing drive issues

Sun Jan 11 10:24:33 UTC 2015

On 01/10/15 05:18, Henrik Hudson wrote:
> On Tue, 06 Jan 2015, Da Rock wrote:
>
>> On 05/01/2015 11:07, William A. Mahaffey III wrote:
>>> On 01/04/15 18:25, Da Rock wrote:
>>>> I haven't seen anything specifically on this when googling, but I'm
>>>> having a strange issue in replacing a degraded drive in ZFS.
>>>>
>>>> The drive has been REMOVED from ZFS pool, and so I ran 'zpool replace
>>>> <pool> <old device> <new device>'. This normally just works, and I
>>>> have checked that I have removed the correct drive via serial number.
>>>>
>>>> After resilvering, it still shows that it is in a degraded state, and
>>>> that the old and the new drive have been REMOVED.
>>>>
>>>> No matter what I do, I can't seem to get the zfs system online and in
>>>> a good state.
>>>>
>>>> I'm running a raidz1 on 9.1 and zfs is v28.
>>>>
>>>> Cheers
>>>> _______________________________________________
>>>> freebsd-questions at freebsd.org mailing list
>>>> http://lists.freebsd.org/mailman/listinfo/freebsd-questions
>>>> To unsubscribe, send any mail to
>>>> "freebsd-questions-unsubscribe at freebsd.org"
>>>>
>>> Someone posted a similar problem a few weeks ago; rebooting fixed it
>>> for them (as opposed to trying to get zfs to fix itself w/ management
>>> commands), might try that if feasible .... $0.02, no more,l no less ....
>>>
>> Sorry, that didn't work unfortunately. I had to wait a bit until I could
>> do it between it trying to resilver and workload. It came online at
>> first, but then went back to removed when I checked again later.
>>
>> Any other diags I can do? I've already run smartctl on all the drives
>> (5hrs+) and they've come back clean. There's not much to go on in the
>> logs either. Do a small number of drives just naturally error when
>> placed in a raid or something?
>> _______________________________________________
>> freebsd-questions at freebsd.org mailing list
>> http://lists.freebsd.org/mailman/listinfo/freebsd-questions
>> To unsubscribe, send any mail to "freebsd-questions-unsubscribe at freebsd.org"
> a) try a 'zpool clear' to perhaps force it to clear errors, but to
> be safe I'd still do "c" below.
>
> b) Did you physically remove the old drive and replace it and then
> run a zpool replace? Did the devices have the same device ID or did
> you use GPT ids?
>
> c) If it's a mirror try just removing the device, zpool remove pool
> device and then re-attaching it via zpool attach.
>
> henrik
>
Thanks for that info, I'll try it next time.

Meanwhile, I had to spend more than a few hours (about 2 days actually - 
each test takes 5+ hours, and it had some hissy fits; some of which 
occurred at about 90% completion, the little #$%!) going through the 
drives and running tests using smartctl and the vendors tools. Turns out 
I had a DOA, but with a twist: using smartctl the test would run on 
other drives, maybe up to 50%, and then stop and say the test failed. On 
the DOA it would pass.

I then turned to the vendor tools, and ran through each drive (I had 8 
to test amongst my lot as I got more than a bit curious/suspicious about 
what was happening overall). I tried testing all in one machine and they 
all interfered with one another, so I needed to test individually and 
try and save the result ( a tricky one given the ridiculous tools 
supplied (I know a good trade never blames his tools, but take windows 
for eg... :) ). Once that was all sorted (24 hours work later), I found 
the DOA drive for my raid would pass a simple test, go through maybe 50% 
of the longer test, and then come up with a failed test - but with 
absolutely no error code (one is expected).

So it was a bit of an odd duck. As a general rule I find the vendor 
rather good and support is second to none, but the drives aren't exactly 
top dollar either so I have no complaints - but this did send me into a 
bit of a spin. At least the experience has been enlightening :)

For reference, smartctl and such aren't taken seriously by vendors. They 
will accept if smart has been tripped (failed health test), but other 
than that you need to use their tools for diags. Maybe not news to some, 
but there's a lot of fluff out there that says otherwise.

Thanks again for the pointers guys!