HAST + ZFS: no action on drive failure

Timothy Smith tts at personalmis.com
Fri Jul 1 03:33:53 UTC 2011


First posting here, hopefully I'm doing it right =)

I also posted this to the FreeBSD forum, but I know some hast folks monitor
this list regularly and not so much there, so...

Basically, I'm testing failure scenarios with HAST/ZFS. I got two nodes,
scripted up a bunch of checks and failover actions between the nodes.
Looking good so far, though more complex that I expected. It would be cool
to post it somewher to get some pointers/critiques, but that's another
thing.

Anyway, now I'm just seeing what happens when a drive fails on primary node.
Oddly/sadly, NOTHING!

Hast just keeps on a ticking, and doesn't change the state of the failed
drive, so the zpool has no clue the drive is offline. The
/dev/hast/<resource> remains. The hastd does log some errors to the system
log like this, but nothing more.

messages.0:Jun 30 18:39:59 nas1 hastd[11066]: [ada6] (primary) Unable to
flush activemap to disk: Device not configured.
messages.0:Jun 30 18:39:59 nas1 hastd[11066]: [ada6] (primary) Local request
failed (Device not configured): WRITE(4736512, 512).

So, I guess the question is, "Do I have to script a cronjob to check for
these kinds of errors and then change the hast resource to 'init' or
something to handle this?" Or is there some kind of hastd config setting
that I need to set? What's the SOP for this?

As something related too, when the zpool in FreeBSD does finally notice that
the drive is missing because I have manually changed the hast resource to
INIT (so the /dev/hast/<res> is gone), my zpool (raidz2) hot spare doesn't
engage, even with "autoreplace=on". The zpool status of the degraded pool
seems to indicate that I should manually replace the failed drive. If that's
the case, it's not really a "hot spare". Does this mean the "FMA Agent"
referred to in the ZFS manual is not implemented in FreeBSD?

thanks!


More information about the freebsd-stable mailing list