smart(8) Call for Testing

Sat Mar 31 15:13:26 UTC 2018

On Fri, Mar 30, 2018 at 11:16 PM, Michael Dexter <editor at callfortesting.org>
wrote:

> On 3/29/18 6:43 AM, Lev Serebryakov wrote:
>
>>    Monitoring of values and alerting is VERY important (number of
>> Relocations is main indicator of spinning HDD health and when it raises
>> it must be known ASAP)
>>
>
> Another metric that frequently came up during outreach was any sudden
> increase in disk latency, usually indicating that you have between one and
> 24 hours to replace the device. I am curious what people are doing now to
> determine such changes in latency and where they feel such monitoring
> should exist in the stack.
>

Netflix has a monitoring program that uses gstat to gather average latency
stats and send them to our centralized data collection data store. It's
really only by looking at the long-term trend that you'll see the spike in
retries that manifests itself as bigger latencies. One problem with gstat,
though, is that it includes software queueing time which for many things is
fine, but when you are trying to determine if the spike is due to extra
load on the device or some hardware thing, then it becomes bothersome. The
CAM I/O scheduler, when the dynamic scheduler is enabled, keeps all kinds
of stats about device latency, including a cumulative latency histogram.
Those are also useful things to look at.

At Netflix, though, we let the disk fail and then mark it as disabled. We
don't look for trends to predict possible failure because we have a fail in
place model that doesn't care if there's data loss because all the data on
the machine is replicated from a central source of truth and can easily be
replaced.

> As for SNMP and friends, I consider those way up the stack with tools like
> smart(8) simply providing a building block.

In many ways, our data collection thing at work is an alternative to SNMP.

Warner