geom(4)/gmirror(4) automatic device DEGRADED status demotion (WAS:Re: gmirror HD failure detection)

Wed Feb 14 20:56:05 UTC 2007

On Wed, 14 Feb 2007, Brian A. Seklecki wrote:

> All:
>
> For a while our strategy was to use NRPE2+ a custom nagios check 
> (check_raid_fbsdgmirror -- ugly-as-hell Perl, but which I can make 
> available  to the public).
>
> However, this morning a drive in a Dell PE1850 (one without a PERC4 
> controller) started erroring.  It has just regular old (bad) mpt(4) 
> controller.
>
> The problem is that gmirror(4) never marked the drive as failed.
>
> I'd have to tear through the code to find where the logic is for automatic 
> demotion of a failed mirror.
>
> Either way, the original thinking behind the Nagios pluging check, was that 
> gmirror(4) would have some threshold of failed attempts to write/read from a 
> provider disk should lead to flagging a provider as "DEGRADED"
>
> Its entirely possible that we never had a chance to test it.
>
> Now I have to go back and re-visit all of that.
>
> ~BAS
>
> On Wed, 20 Sep 2006, Alex Zbyslaw wrote:
>
>> Robin Becker wrote:
>> 
>>> After using Dru Lavigne's excellent article http://tinyurl.com/da66a about 
>>> Raid-1 I have a full Raid-1 mirror on a new rack server. I'm wondering if 
>>> anyone can tell me how best to monitor the hardware status to detect 
>>> imminent failure of one of the disks? Do I use something like smartctl in 
>>> a cron or what?
>> 
>> Assuming that the disks support SMART then just read the man page for 
>> smartd. No need for cron.  You can also schedule "short" and "long" tests 
>> to run in off hours.  smartmontools is easy to uninstall if it doesn't work 
>> for you. However, this will tell you that a disk is failing (or failed) 
>> which is not quite the same as array status.  An array (theoretically) 
>> might be sub-optimal for non-SMART reasons.  Someone familiar with gmirror 
>> will have to answer that bit... but gmirror status -s looks from the man 
>> page like it might be interesting and *that* could be run from cron and 
>> parsed to weed out "status OK results".