ZFS - hot spares : automatic or not?
cforgeron at acsi.ca
Thu Jan 13 00:32:45 UTC 2011
Interesting, I was just testing Solaris 11 Express's ability to handle a pulled drive today. It handles it quite well. However, my Areca 1880 drive (arcmsr0) crashes when you reinsert the drive.. but that's another topic, and an issue for Areca tech support..
..back to the point:
Solaris runs a separate process called Fault Management Daemon (fmd) that looks to handle this logic - This means that it's really not inside the ZFS code to handle this, and FreeBSD would need something similar, hopefully less kludgy than a user script.
I wonder if anyone has been eyeing the fma code in the cddl with a thought for porting it - It looks to be a really neat bit of code - I'm still quite new with it, having only been working with Solaris the last few months.
Here's two links to a bit of info on the Solaris daemon:
Here's my log of the event in Solaris 11 Express:
Jan 12 21:28:47 solaris fmd: [ID 377184 daemon.error] SUNW-MSG-ID: ZFS-8000-FD, TYPE: Fault, VER: 1, SEVERITY: Major
Jan 12 21:28:47 solaris EVENT-TIME: Wed Jan 12 21:28:47 UTC 2011
Jan 12 21:28:47 solaris PLATFORM: PowerEdge-T710, CSN: 39SLQN1, HOSTNAME: solaris
Jan 12 21:28:47 solaris SOURCE: zfs-diagnosis, REV: 1.0
Jan 12 21:28:47 solaris EVENT-ID: ccfa7a23-838b-ebc8-decf-c2607afb390d
Jan 12 21:28:47 solaris DESC: The number of I/O errors associated with a ZFS device exceeded
Jan 12 21:28:47 solaris acceptable levels. Refer to http://sun.com/msg/ZFS-8000-FD for more information.
Jan 12 21:28:47 solaris AUTO-RESPONSE: The device has been offlined and marked as faulted. An attempt
Jan 12 21:28:47 solaris will be made to activate a hot spare if available.
Jan 12 21:28:47 solaris IMPACT: Fault tolerance of the pool may be compromised.
Jan 12 21:28:47 solaris REC-ACTION: Run 'zpool status -x' and replace the bad device.
From: owner-freebsd-stable at freebsd.org [mailto:owner-freebsd-stable at freebsd.org] On Behalf Of John Hawkes-Reed
Sent: Tuesday, January 11, 2011 12:11 PM
To: Dan Langille
Subject: Re: ZFS - hot spares : automatic or not?
On 11/01/2011 03:38, Dan Langille wrote:
> On 1/4/2011 11:52 AM, John Hawkes-Reed wrote:
>> On 04/01/2011 03:08, Dan Langille wrote:
>>> Hello folks,
>>> I'm trying to discover if ZFS under FreeBSD will automatically pull in a
>>> hot spare if one is required.
>>> This raised the issue back in March 2010, and refers to a PR opened in
>>> May 2009
>>> * http://lists.freebsd.org/pipermail/freebsd-fs/2010-March/007943.html
>>> * http://www.freebsd.org/cgi/query-pr.cgi?pr=134491
>>> In turn, the PR refers to this March 2010 post referring to using devd
>>> to accomplish this task.
>>> Does the above represent the the current state?
>>> I ask because I just ordered two more HDD to use as spares. Whether they
>>> sit on the shelf or in the box is open to discussion.
>> As far as our testing could discover, it's not automatic.
>> I wrote some Ugly Perl that's called by devd when it spots a drive-fail
>> event, which seemed to DTRT when simulating a failure by pulling a drive.
> Without such a script, what is the value in creating hot spares?
We went through that loop in the office.
We're used to the way the Netapps work here, where often one's first
notice of a failed disk is a visit from the courier with a replacement.
(I'm only half joking)
In the end, writing enough perl to swap in the spare disk made much more
sense than paging the relevant admin on disk-fail and expecting them to
be able to type straight at 4AM.
Our thinking is that having a hot spare allows us to do the physical
disk-swap in office hours, rather than (for instance) running in a
degraded state over a long weekend.
If it's of interest, I'll see if I can share the code.
freebsd-stable at freebsd.org mailing list
To unsubscribe, send any mail to "freebsd-stable-unsubscribe at freebsd.org"
More information about the freebsd-stable