HAST + ZFS + NFS + CARP

Fri Jul 1 09:19:13 UTC 2016

> On 01 Jul 2016, at 10:47, Julien Cigar <julien at perdition.city> wrote:
> 
> On Thu, Jun 30, 2016 at 11:35:49PM +0200, Ben RUBSON wrote:
>> 
>>> On 30 Jun 2016, at 18:35, Julien Cigar <julien at perdition.city> wrote:
>>> 
>>> On Thu, Jun 30, 2016 at 05:42:04PM +0200, Ben RUBSON wrote:
>>>> 
>>>> 
>>>>> On 30 Jun 2016, at 17:37, Julien Cigar <julien at perdition.city> wrote:
>>>>> 
>>>>>> On Thu, Jun 30, 2016 at 05:28:41PM +0200, Ben RUBSON wrote:
>>>>>> 
>>>>>>> On 30 Jun 2016, at 17:14, InterNetX - Juergen Gotteswinter <jg at internetx.com> wrote:
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> Am 30.06.2016 um 16:45 schrieb Julien Cigar:
>>>>>>>> Hello,
>>>>>>>> 
>>>>>>>> I'm always in the process of setting a redundant low-cost storage for 
>>>>>>>> our (small, ~30 people) team here.
>>>>>>>> 
>>>>>>>> I read quite a lot of articles/documentations/etc and I plan to use HAST
>>>>>>>> with ZFS for the storage, CARP for the failover and the "good old NFS"
>>>>>>>> to mount the shares on the clients.
>>>>>>>> 
>>>>>>>> The hardware is 2xHP Proliant DL20 boxes with 2 dedicated disks for the
>>>>>>>> shared storage.
>>>>>>>> 
>>>>>>>> Assuming the following configuration:
>>>>>>>> - MASTER is the active node and BACKUP is the standby node.
>>>>>>>> - two disks in each machine: ada0 and ada1.
>>>>>>>> - two interfaces in each machine: em0 and em1
>>>>>>>> - em0 is the primary interface (with CARP setup)
>>>>>>>> - em1 is dedicated to the HAST traffic (crossover cable)
>>>>>>>> - FreeBSD is properly installed in each machine.
>>>>>>>> - a HAST resource "disk0" for ada0p2.
>>>>>>>> - a HAST resource "disk1" for ada1p2.
>>>>>>>> - a zpool create zhast mirror /dev/hast/disk0 /dev/hast/disk1 is created
>>>>>>>> on MASTER
>>>>>>>> 
>>>>>>>> A couple of questions I am still wondering:
>>>>>>>> - If a disk dies on the MASTER I guess that zpool will not see it and
>>>>>>>> will transparently use the one on BACKUP through the HAST ressource..
>>>>>>> 
>>>>>>> thats right, as long as writes on $anything have been successful hast is
>>>>>>> happy and wont start whining
>>>>>>> 
>>>>>>>> is it a problem? 
>>>>>>> 
>>>>>>> imho yes, at least from management view
>>>>>>> 
>>>>>>>> could this lead to some corruption?
>>>>>>> 
>>>>>>> probably, i never heard about anyone who uses that for long time in
>>>>>>> production
>>>>>>> 
>>>>>>> At this stage the
>>>>>>>> common sense would be to replace the disk quickly, but imagine the
>>>>>>>> worst case scenario where ada1 on MASTER dies, zpool will not see it 
>>>>>>>> and will transparently use the one from the BACKUP node (through the 
>>>>>>>> "disk1" HAST ressource), later ada0 on MASTER dies, zpool will not 
>>>>>>>> see it and will transparently use the one from the BACKUP node 
>>>>>>>> (through the "disk0" HAST ressource). At this point on MASTER the two 
>>>>>>>> disks are broken but the pool is still considered healthy ... What if 
>>>>>>>> after that we unplug the em0 network cable on BACKUP? Storage is
>>>>>>>> down..
>>>>>>>> - Under heavy I/O the MASTER box suddently dies (for some reasons), 
>>>>>>>> thanks to CARP the BACKUP node will switch from standy -> active and 
>>>>>>>> execute the failover script which does some "hastctl role primary" for
>>>>>>>> the ressources and a zpool import. I wondered if there are any
>>>>>>>> situations where the pool couldn't be imported (= data corruption)?
>>>>>>>> For example what if the pool hasn't been exported on the MASTER before
>>>>>>>> it dies?
>>>>>>>> - Is it a problem if the NFS daemons are started at boot on the standby
>>>>>>>> node, or should they only be started in the failover script? What
>>>>>>>> about stale files and active connections on the clients?
>>>>>>> 
>>>>>>> sometimes stale mounts recover, sometimes not, sometimes clients need
>>>>>>> even reboots
>>>>>>> 
>>>>>>>> - A catastrophic power failure occur and MASTER and BACKUP are suddently
>>>>>>>> powered down. Later the power returns, is it possible that some
>>>>>>>> problem occur (split-brain scenario ?) regarding the order in which the
>>>>>>> 
>>>>>>> sure, you need an exact procedure to recover
>>>>>>> 
>>>>>>>> two machines boot up?
>>>>>>> 
>>>>>>> best practice should be to keep everything down after boot
>>>>>>> 
>>>>>>>> - Other things I have not thought?
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> Thanks!
>>>>>>>> Julien
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> imho:
>>>>>>> 
>>>>>>> leave hast where it is, go for zfs replication. will save your butt,
>>>>>>> sooner or later if you avoid this fragile combination
>>>>>> 
>>>>>> I was also replying, and finishing by this :
>>>>>> Why don't you set your slave as an iSCSI target and simply do ZFS mirroring ?
>>>>> 
>>>>> Yes that's another option, so a zpool with two mirrors (local + 
>>>>> exported iSCSI) ?
>>>> 
>>>> Yes, you would then have a real time replication solution (as HAST), compared to ZFS send/receive which is not.
>>>> Depends on what you need :)
>>> 
>>> More a real time replication solution in fact ... :)
>>> Do you have any resource which resume all the pro(s) and con(s) of HAST
>>> vs iSCSI ? I have found a lot of article on ZFS + HAST but not that much
>>> with ZFS + iSCSI .. 
>> 
>> # No resources, but some ideas :
>> 
>> - ZFS likes to see all the details of its underlying disks, which is possible with local disks (of course) and iSCSI disks, not with HAST.
>> - iSCSI solution is simpler, you only have ZFS to manage, your replication is made by ZFS itself, not by an additional stack.
>> - HAST does not seem to be really maintained (I may be wrong), at least compared to DRBD HAST seems to be inspired from.
>> - You do not have to cross your fingers when you promote your slave to master ("will ZFS be happy with my HAST replicated disks ?"), ZFS mirrored data by itself, you only have to import [-f].
>> 
>> - (auto)reconnection of iSCSI could not be as simple as with HAST, iSCSI could require more administration after a disconnection. But this could easily be done by a script.
>> 
>> # Some "advices" based on my findings (I'm finishing my tests of such a solution) :
>> 
>> Write performance will suffer from network latency, but while your 2 nodes are in the same room, that should be OK.
>> If you are over a long distance link, you may add several ms to each write IO, which, depending on the use case, may be wrong, ZFS may also be unresponsive.
>> Max throughput is also more difficult to achieve over a high latency link.
>> 
>> You will have to choose network cards depending on the number of disks and their throughput.
>> For example, if you need to resilver a SATA disk (180MB/s), then a simple 1GB interface (120MB/s) will be a serious bottleneck.
>> Think about scrub too.
>> 
>> You should have to perform some network tuning (TCP window size, jumbo frame...) to reach your max bandwidth.
>> Trying to saturate network link with (for example) iPerf before dealing with iSCSI seems to be a good thing.
>> 
>> Here are some interesting sysctl so that ZFS will not hang (too long) in case of an unreachable iSCSI disk :
>> kern.iscsi.ping_timeout=5
>> kern.iscsi.iscsid_timeout=5
>> kern.iscsi.login_timeout=5
>> kern.iscsi.fail_on_disconnection=1
>> (adjust the 5 seconds depending on your needs / on your network quality).
>> 
>> Take care when you (auto)replace disks, you may replace an iSCSI disk with a local disk, which of course would work but would be wrong in terms of master/slave redundancy.
>> Use nice labels on your disks so that if you have a lot of disks in your pool, you quickly know which one is local, which one is remote.
>> 
>> # send/receive pro(s) :
>> 
>> In terms of data safety, one of the interests of ZFS send/receive is that you have a totally different target pool, which can be interesting if ever you have a disaster with your primary pool.
>> As a 3rd node solution ? On another site ? (as send/receive does not suffer as iSCSI would from latency)
> 
> Thank you very much for those "advices", it is much appreciated! 
> 
> I'll definitively go with iSCSI (for which I haven't that much 
> experience) over HAST.
> 
> Maybe a stupid question but, assuming on the MASTER with ada{0,1} the 
> local disks and da{0,1} the exported iSCSI disks from the SLAVE, would 
> you go with:
> 
> $> zpool create storage mirror /dev/ada0s1 /dev/ada1s1 mirror /dev/da0
> /dev/da1

No, if you loose connection with slave node, your pool will go offline !

> or rather:
> 
> $> zpool create storage mirror /dev/ada0s1 /dev/da0 mirror /dev/ada1s1
> /dev/da1

Yes, each master disk is mirrored with a slave disk.

> I guess the former is better, but it's just to be sure .. (or maybe it's
> better to iSCSI export a ZVOL from the SLAVE?)
> 
> Correct me if I'm wrong but, from a safety point of view this setup is 
> also the safest as you'll get the "fullsync" equivalent mode of HAST
> (but but it's also the slowest), so I can be 99,99% confident that the
> pool on the SLAVE will never be corrupted, even in the case where the
> MASTER suddently die (power outage, etc), and that a zpool import -f
> storage will always work?

Pool on slave is the same as pool on master, as it uses the same disks :)
Only the physical host will change.
So yes you can be confident.
There is still the case where any ZFS pool could be totally damaged (due to a bug for example).
It "should" not arrive, but we never know :)
This is why I was talking about a third node / second pool made from a delayed send/receive.

> One last thing: this "storage" pool will be exported through NFS on the 
> clients, and when a failover occur they should, in theory, not notice
> it. I know that it's pretty hypothetical but I wondered if pfsync could
> play a role in this area (active connections)..?

There will certainly be some small timeouts due to the failover delay.
You should make some tests to analyze NFS behaviour depending on the failover delay.

Good question regarding pfsync, I'm not so familiar with it :)

Of course, make a good POC before going with this into production.
Don't forget to test scrub, resilver, power failure, network failure...

And perhaps one may have additional comments / ideas / reserve on this topic ?

> Thanks!
> Julien
> 
>> 
>>>>>> ZFS would then know as soon as a disk is failing.
>>>>>> And if the master fails, you only have to import (-f certainly, in case of a master power failure) on the slave.
>>>>>> 
>>>>>> Ben
>> _______________________________________________
>> freebsd-fs at freebsd.org mailing list
>> https://lists.freebsd.org/mailman/listinfo/freebsd-fs
>> To unsubscribe, send any mail to "freebsd-fs-unsubscribe at freebsd.org"
> 
> -- 
> Julien Cigar
> Belgian Biodiversity Platform (http://www.biodiversity.be)
> PGP fingerprint: EEF9 F697 4B68 D275 7B11  6A25 B2BB 3710 A204 23C0
> No trees were killed in the creation of this message.
> However, many electrons were terribly inconvenienced.