drive selection for disk arrays

Fri Mar 27 13:05:49 UTC 2020

On Thu, 26 Mar 2020, Kevin P. Neal wrote:

> On Thu, Mar 26, 2020 at 04:37:58PM -0400, Daniel Feenberg wrote:
>>
>> The disturbing frequency of multiple drives going offline in quick
>> succession is, in my view, largely a result of defects being discovered in
>> quick succession, rather than occuring in quick succession. If a defect
>> occurs in a sector that is rarely visited it can remain hidden for a long
>> time. During a resilver that defect will be noticed and the drive failed
>> out. I do think that is an overly aggressive action by the resilvering
>> process, as that may be the only bad sector, it may be possible to recover
>> all the data from the remaining drives (if the first failing drive can read
>> the appropriate sector), and that sector may not even be in an active file.
>
> I thought that got fixed? I thought that a drive wouldn't be failed out
> of a pool due to simply bad sectors. Possibly this is only prevented during
> a resilver.

Can anyone provide an authoritative reference? I would sleep better if I 
knew this was fixed. When I raised the issue a decade ago, I was told 
"only an IT admin of low moral character would ask for rsilvering to 
continue in the presence of an error", or words to that effect. And that 
was before it took a month to copy out the good data to an alternate store 
and another month to copy it back.

Daniel Feenberg

>
> Am I remembering wrong?
>
> -- 
> Kevin P. Neal                                http://www.pobox.com/~kpn/
>
> "Nonbelievers found it difficult to defend their position in \
>    the presense of a working computer." -- a DEC Jensen paper
>