iSCSI/ZFS strangeness

Fri Oct 30 13:07:16 UTC 2015

> On Oct 29, 2015, at 1:17 PM, Jan Bramkamp <crest at rlwinm.de> wrote:
> 
>> On 29/10/15 02:57, Michael W. Lucas wrote:
>> The initiators can both access the iSCSI-based pool--not
>> simultaneously, of course. But CARP, devd, and some shell scripting
>> should get me a highly available pool that can withstand the demise of
>> any one iSCSI server and any one initiator.
>> 
>> The hope is that the pool would continue to work even if an iSCSI host
>> shuts down. When the downed iSCSI host returns, the initiators should
>> log back in and the pool auto-resilver.
> 
> I would recommend against using CARP for this because CARP is prone to split-brain situations and in this case they could destroy your whole storage pool. If the current head node fails the replacement has to `zpool import -f` the pool and and in the case of a split-brain situation both head nodes would continue writing to the iSCSI targets.
> 
> I would move the leader election to an external service like consul, etcd or zookeeper. This is one case where the added complexity is worth it. If you can't run an external service for this e.g. it would exceed the scope of the chapter you're writing please simplify the setup with more reliable hardware, good monitoring and manual failover for maintenance. CARP isn't designed to implement reliable (enough) master election for your storage cluster.
> 
> Adding iSCSI to your storage stack adds complexity and overhead. For setups which still fit inside a single rack SAS (with geom_multipath) is normally faster and cheaper. On the other hand you can't spread out SAS storage far enough to implement disaster tolerance should you really need it and it certainly is an setup.

I'll impart some wisdom here.

1) HA with two nodes is impossible to do right.  You need a third system to achieve quorum.

2) You can do SAS over optical these days. Perfect for having mirrored JBODs in different fire suppression zones of a datacenter.

3) I've seen a LOT of "cobbled together with shell script" HA rigs.  They mostly get disabled eventually as it's realized that they go split brain in the edge cases and destroy the storage.  What we did was go passive/passive and then address those cases as "how could we have avoided going passive/passive". It took two years.

4) Leverage mav@'s ALUA support.  For block access this will make your life much easier.

5) Give me a call. I type slow and tend to leave things out, but would happily do one or more brain dump sessions.