drive selection for disk arrays

Fri Mar 27 09:51:35 UTC 2020

On Thu, 26 Mar 2020 16:37:58 -0400 (EDT), Daniel Feenberg wrote:
> 
> The disturbing frequency of multiple drives going offline in quick 
> succession is, in my view, largely a result of defects being discovered in 
> quick succession, rather than occuring in quick succession. If a defect 
> occurs in a sector that is rarely visited it can remain hidden for a long 
> time. During a resilver that defect will be noticed and the drive failed 
> out. I do think that is an overly aggressive action by the resilvering 
> process, as that may be the only bad sector, it may be possible to recover 
> all the data from the remaining drives (if the first failing drive can 
> read the appropriate sector), and that sector may not even be in an active 
> file.

I'd like to mention something in this context:

When a drive _reports_ bad sectors, at least in the past
it was an indication that it already _has_ lots of them.
The drive's firmware will remap bad sectors to spare
sectors, so "no error" so far. When errors are being
reported "upwards" ("read error" or "write error"
visible to the OS), it's a sign that the disk has run
out of spare sectors, and the firmware cannot silently
remap _new_ bad sectors...

Is this still the case with modern drives?

How transparently can ZFS handle drive errors when the
drives only report the "top results" (i. e., cannot cope
with bad sectors internally anymore)? Do SMART tools help
here, for example, by reading certain firmware-provided
values that indicate how many sectors _actually_ have
been marked as "bad sector", remapped internally, and
_not_ reported to the controller / disk I/O subsystem /
filesystem yet? This should be a good indicator of "will
fail soon", so a replacement can be done while no data
loss or other problems appears.

-- 
Polytropon
Magdeburg, Germany
Happy FreeBSD user since 4.0
Andra moi ennepe, Mousa, ...