smartctl question

Fri Nov 9 21:23:39 UTC 2012

On Fri, Nov 9, 2012 at 3:47 AM, Lucas B. Cohen <lbc at bnrlabs.com> wrote:
> Hi,
>
> On 2012.11.09 12:18, H. Ingow wrote:
>>
>> Hi all,
>>
>> one single disk in a zfs mirror failed permanently throwing errors like
>> kernel: (ada5:ata10:0:0:0): ATA status: 51 (DRDY SERV ERR), error: 84
>> (ICRC ABRT ) and alike.
>>
>> The pool itself continued working degraded, smartctl showed a very high
>> 199 UDMA_CRC_Error_Count value, which to my knowledge may indicate a
>> broken cable, in this case indeed a  cable replacement  solved the
>> problem, the pool resilvered and all is fine.
>>
>> Still smartctl -a displays a value of 199 UDMA_CRC_Error_Count I reckon
>> to be way too high, though ( > 3900 ) .
>> So is this value now including errors from previous broken cable ?
>
> I'm pretty sure it is. I don't think SMART attributes can vary in value
> both up and down ; they seem to me like they're counters that can only
> get incremented.
>
>> In other words, when, if at all, is the cache smartmontools read from
>> flushed and values are to be taken as of the status after fixing a
>> hardware problem but not swapping the disk ?
> So, in my opinion no.

This is a problem with S.M.A.R.T. All stats are stored by the drive in
the drive and the assumption is that all of the errors are caused by
problems in the drive (and usually are). But when they are from a
cable problem, the drive never sees the problem as "gone", so the
counters never reset. As long as you remember that you had a cable
problem with that drive and that the count was 199, you can discount
it or recognize a problem down the road if it starts increasing. I'd
put it on a label that can be stuck to the drive as a last reminder
that the count is "off by 199".

By the way, I believe that some stats do go up and down, but not
counters. Like in snmp, counters are never supposed to be reset or
resettable.
-- 
R. Kevin Oberman, Network Engineer
E-mail: kob6558 at gmail.com