drive selection for disk arrays

Sat Mar 28 00:39:54 UTC 2020

On 2020-03-27 02:45, Polytropon wrote:

> When a drive _reports_ bad sectors, at least in the past
> it was an indication that it already _has_ lots of them.
> The drive's firmware will remap bad sectors to spare
> sectors, so "no error" so far. 

If a drive detects an error, my guess is that it will report the error 
to the OS; regardless of the outcome of a particular I/O operation (data 
read, data written, data lost) or internal actions taken (block marked 
bad, block remapped, etc.).  It is then up to the OS to decide what to 
do next.  RAID and/or ZFS offer the means for shielding the application 
from I/O and drive failures.

> When errors are being
> reported "upwards" ("read error" or "write error"
> visible to the OS), it's a sign that the disk has run
> out of spare sectors, and the firmware cannot silently
> remap _new_ bad sectors...
> 
> Is this still the case with modern drives?
> 
> How transparently can ZFS handle drive errors when the
> drives only report the "top results" (i. e., cannot cope
> with bad sectors internally anymore)? Do SMART tools help
> here, for example, by reading certain firmware-provided
> values that indicate how many sectors _actually_ have
> been marked as "bad sector", remapped internally, and
> _not_ reported to the controller / disk I/O subsystem /
> filesystem yet? This should be a good indicator of "will
> fail soon", so a replacement can be done while no data
> loss or other problems appears.

I have been using smartctl(8) occasionally for many years.  The "SMART 
Attributes Data Structure" report would seem to hold statistics that 
should be useful for predicting failures.

This is my SOHO server:

2020-03-27 17:20:00 toor at f3 ~
# freebsd-version ; uname -a
12.1-RELEASE-p2
FreeBSD f3.tracy.holgerdanske.com 12.1-RELEASE-p2 FreeBSD 
12.1-RELEASE-p2 GENERIC  amd64

This is a data drive:

2020-03-27 17:20:05 toor at f3 ~
# geom disk list ada1
Geom name: ada1
Providers:
1. Name: ada1
    Mediasize: 3000592982016 (2.7T)
    Sectorsize: 512
    Mode: r1w1e3
    descr: SEAGATE ST33000650NS
    lunid: 5000c5004e7ce23f
    ident: <redacted>
    rotationrate: 7200
    fwsectors: 63
    fwheads: 16

2020-03-27 17:20:08 toor at f3 ~
# smartctl -x /dev/ada1 | grep -A 30 'SMART Attributes Data Structure'
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
   1 Raw_Read_Error_Rate     POSR--   078   066   044    -    78783152
   3 Spin_Up_Time            PO----   092   091   000    -    0
   4 Start_Stop_Count        -O--CK   100   100   020    -    20
   5 Reallocated_Sector_Ct   PO--CK   100   100   036    -    0
   7 Seek_Error_Rate         POSR--   066   060   030    -    4532285
   9 Power_On_Hours          -O--CK   100   100   000    -    612
  10 Spin_Retry_Count        PO--C-   100   100   097    -    0
  12 Power_Cycle_Count       -O--CK   100   100   020    -    20
184 End-to-End_Error        -O--CK   100   100   099    -    0
187 Reported_Uncorrect      -O--CK   100   100   000    -    0
188 Command_Timeout         -O--CK   100   100   000    -    0
189 High_Fly_Writes         -O-RCK   100   100   000    -    0
190 Airflow_Temperature_Cel -O---K   051   046   045    -    49 (Min/Max 
39/54)
191 G-Sense_Error_Rate      -O--CK   100   100   000    -    0
192 Power-Off_Retract_Count -O--CK   100   100   000    -    6
193 Load_Cycle_Count        -O--CK   100   100   000    -    20
194 Temperature_Celsius     -O---K   049   054   000    -    49 (0 21 0 0 0)
195 Hardware_ECC_Recovered  -O-RC-   033   031   000    -    78783152
197 Current_Pending_Sector  -O--C-   100   100   000    -    0
198 Offline_Uncorrectable   ----C-   100   100   000    -    0
199 UDMA_CRC_Error_Count    -OSRCK   200   200   000    -    0
                             ||||||_ K auto-keep
                             |||||__ C event count
                             ||||___ R error rate
                             |||____ S speed/performance
                             ||_____ O updated online
                             |______ P prefailure warning

The following attributes look like they may be related to drive failure, 
but I do not know the engineering definition of these attributes nor the 
engineering definition of the values reported:

Reallocated_Sector_Ct
Seek_Error_Rate
End-to-End_Error
Reported_Uncorrect
Hardware_ECC_Recovered
Offline_Uncorrectable
UDMA_CRC_Error_Count

I do feel the need to implemented automated SMART monitoring, but have 
yet to embark on that journey.

David