smart(8) Call for Testing

Thu Mar 29 17:49:22 UTC 2018

On Thu, Mar 29, 2018 at 11:37 AM, Charles Sprickman via freebsd-fs <
freebsd-fs at freebsd.org> wrote:
>
> > But all my dead HDDs were replaced on self-test fail — it is what
> > allows me to replace them BEFORE data were lost.
>
> Yep, lots of folks claim the data is useless, but generally I see some
> signs of
> failure before the drive dies, and sometimes those signs are spotted
> because
> smartd is triggering regular self-tests.  And on SSDs, watching the MWI
> seems
> to work very well - these drives are much smarter (no pun intended) than
> spinny
> disks.

SMART lives in that area between "not reliably useful" and "sometimes
interesting". It's a kinda good enough system that kinda sorta signals
things, sometimes, if you are luck.

We've found at $WORK that many of the metrics are suggestive and help us
monitor overall storage health, but only because we look at specific ones,
and look for trends and outliers form the rest of the herd. For that it can
be mildly useful. For example, we found that the %life used jumped suddenly
on some systems that had new firmware deployed and discovered a overly
aggressive writing bug in our control software (to be fair, it was in the
database back end rebalancing tables for each row insert due to bugs in it,
so a 100MB table wound up generating 100GB in writes). We've also used it
to identify certain machines with excessively high write amp which turned
out to be a different issue that was easily fixed. If you know what to look
for, and have a lot of experience with the drives, the SMART data can be
quite useful. So it's useful, but not without some experience and a very
large sample to use to find outliers.

We don't bother to use it for drive failure. While scanning is nice, it's
too invasive to do on a regular basis. Sometimes we use it to force errors
on drives we already suspect of being bad, but usually we run the drive
until it fails then throw the data that was on it away (Work is Netflix
Open Connect caching servers, so we lose nothing if we dump the data since
it's just copies of copies). Once the drive fails (or becomes too
unreliable short of total failure), we fail it in place and just ignore it
from that point forward and suffer from reduced capacity. But failures are
driven by actual I/O errors, not by SMART data.

Warner