ZFS - hot spares : automatic or not?

Thu Jan 13 01:17:40 UTC 2011

On 01/12/11 19:32, Chris Forgeron wrote:
> Interesting, I was just testing Solaris 11 Express's ability to handle a pulled drive today. It handles it quite well. However, my Areca 1880 drive (arcmsr0) crashes when you reinsert the drive.. but that's another topic, and an issue for Areca tech support..
>
> ..back to the point:
>
> Solaris runs a separate process called Fault Management Daemon (fmd) that looks to handle this logic - This means that it's really not inside the ZFS code to handle this, and FreeBSD would need something similar, hopefully less kludgy than a user script.
>
> I wonder if anyone has been eyeing the fma code in the cddl with a thought for porting it - It looks to be a really neat bit of code - I'm still quite new with it, having only been working with Solaris the last few months.
>
> Here's two links to a bit of info on the Solaris daemon:
>
> http://www.princeton.edu/~unix/Solaris/troubleshoot/fm.html
> http://hub.opensolaris.org/bin/view/Community+Group+fm/
>
>
> Here's my log of the event in Solaris 11 Express:
>
> Jan 12 21:28:47 solaris fmd: [ID 377184 daemon.error] SUNW-MSG-ID: ZFS-8000-FD, TYPE: Fault, VER: 1, SEVERITY: Major
> Jan 12 21:28:47 solaris EVENT-TIME: Wed Jan 12 21:28:47 UTC 2011
> Jan 12 21:28:47 solaris PLATFORM: PowerEdge-T710, CSN: 39SLQN1, HOSTNAME: solaris
> Jan 12 21:28:47 solaris SOURCE: zfs-diagnosis, REV: 1.0
> Jan 12 21:28:47 solaris EVENT-ID: ccfa7a23-838b-ebc8-decf-c2607afb390d
> Jan 12 21:28:47 solaris DESC: The number of I/O errors associated with a ZFS device exceeded
> Jan 12 21:28:47 solaris              acceptable levels.  Refer to http://sun.com/msg/ZFS-8000-FD for more information.
> Jan 12 21:28:47 solaris AUTO-RESPONSE: The device has been offlined and marked as faulted.  An attempt
> Jan 12 21:28:47 solaris              will be made to activate a hot spare if available.
> Jan 12 21:28:47 solaris IMPACT: Fault tolerance of the pool may be compromised.
> Jan 12 21:28:47 solaris REC-ACTION: Run 'zpool status -x' and replace the bad device.

After a cursory glance at their fault-management infrastructure, I 
noticed that it also deals with other kinds of stuff like CPU and memory 
problems, which might make a port painful or impractical. Would the 
people with custom hot-spare scripts, or nothing automated at all, be 
content if the sysutils/geomWatch program grew support for hot spares in 
a future version? I already became somewhat familiar with the userland 
ZFS API when I added ZFS support to it.

-Boris