geom(4)/gmirror(4) automatic device DEGRADED status demotion
(WAS:Re: gmirror HD failure detection)
Brian A. Seklecki
lavalamp at spiritual-machines.org
Wed Feb 14 20:56:05 UTC 2007
On Wed, 14 Feb 2007, Brian A. Seklecki wrote:
> All:
>
> For a while our strategy was to use NRPE2+ a custom nagios check
> (check_raid_fbsdgmirror -- ugly-as-hell Perl, but which I can make
> available to the public).
>
> However, this morning a drive in a Dell PE1850 (one without a PERC4
> controller) started erroring. It has just regular old (bad) mpt(4)
> controller.
>
> The problem is that gmirror(4) never marked the drive as failed.
>
> I'd have to tear through the code to find where the logic is for automatic
> demotion of a failed mirror.
>
> Either way, the original thinking behind the Nagios pluging check, was that
> gmirror(4) would have some threshold of failed attempts to write/read from a
> provider disk should lead to flagging a provider as "DEGRADED"
>
> Its entirely possible that we never had a chance to test it.
>
> Now I have to go back and re-visit all of that.
>
> ~BAS
>
> On Wed, 20 Sep 2006, Alex Zbyslaw wrote:
>
>> Robin Becker wrote:
>>
>>> After using Dru Lavigne's excellent article http://tinyurl.com/da66a about
>>> Raid-1 I have a full Raid-1 mirror on a new rack server. I'm wondering if
>>> anyone can tell me how best to monitor the hardware status to detect
>>> imminent failure of one of the disks? Do I use something like smartctl in
>>> a cron or what?
>>
>> Assuming that the disks support SMART then just read the man page for
>> smartd. No need for cron. You can also schedule "short" and "long" tests
>> to run in off hours. smartmontools is easy to uninstall if it doesn't work
>> for you. However, this will tell you that a disk is failing (or failed)
>> which is not quite the same as array status. An array (theoretically)
>> might be sub-optimal for non-SMART reasons. Someone familiar with gmirror
>> will have to answer that bit... but gmirror status -s looks from the man
>> page like it might be interesting and *that* could be run from cron and
>> parsed to weed out "status OK results".
More information about the freebsd-questions
mailing list