HAST + ZFS + NFS + CARP

krad kraduk at gmail.com
Thu Aug 18 10:38:28 UTC 2016


"new day, new things learned :)" job done for today then, it must be beer o
clock?

On 18 August 2016 at 09:02, InterNetX - Juergen Gotteswinter <
juergen.gotteswinter at internetx.com> wrote:

> new day, new things learned :)
>
> thanks!
>
> but like said, zrep does its on locking in zfs properties. so even this
> is fine
>
>         while true; do zrep sync all; done
>
>
> see
>
> http://www.bolthole.com/solaris/zrep/
>
> the properties look like this
>
> tank/vmail  redundant_metadata    all                    default
> tank/vmail  zrep:savecount        5                      local
> tank/vmail  zrep:lock-time        20160620101703         local
> tank/vmail  zrep:master           yes                    local
> tank/vmail  zrep:src-fs           tank/vmail             local
> tank/vmail  zrep:dest-host        stor1                local
> tank/vmail  zrep:src-host         stor2                local
> tank/vmail  zrep:dest-fs          tank/vmail             local
> tank/vmail  zrep:lock-pid         10887                  local
>
>
> it also takes care of the replication partner, the replicated datasets
> are read only until you tell zrep "go go go, become master"
>
> Simple usage summary:
> zrep (init|-i) ZFS/fs remotehost remoteZFSpool/fs
> zrep (sync|-S) [-q seconds] ZFS/fs
> zrep (sync|-S) [-q seconds] all
> zrep (sync|-S) ZFS/fs at snapshot    -- temporary retroactive sync
> zrep (status|-s) [-v] [(-a|ZFS/fs)]
> zrep refresh ZFS/fs               -- pull version of sync
> zrep (list|-l) [-Lv]
> zrep (expire|-e) [-L] (ZFS/fs ...)|(all)|()
> zrep (changeconfig|-C) [-f] ZFS/fs remotehost remoteZFSpool/fs
> zrep (changeconfig|-C) [-f] [-d] ZFS/fs srchost srcZFSpool/fs
> zrep failover [-L] ZFS/fs
> zrep takeover [-L] ZFS/fs
>
>
> zrep failover pool/ds -> master sets pool read only, connects to slave,
> sets pool on slave rw
>
> should be easy to combine with carp/devd, but this is the land of vodoo
> automagic again which i dont trust that much.
>
>
> Am 18.08.2016 um 09:40 schrieb Ben RUBSON:
> > Yep this is better :
> >
> > if mkdir <lockdir>
> > then
> >       do_your_job
> >       rm -rf <lockdir>
> > fi
> >
> >
> >
> >> On 18 Aug 2016, at 09:38, InterNetX - Juergen Gotteswinter <
> juergen.gotteswinter at internetx.com> wrote:
> >>
> >> uhm, dont really investigated if it is or not. add a "sync" after that?
> >> or replace it?
> >>
> >> but anyway, thanks for the hint. will dig into this!
> >>
> >> Am 18.08.2016 um 09:36 schrieb krad:
> >>> I didnt think touch was atomic, mkdir is though
> >>>
> >>> On 18 August 2016 at 08:32, InterNetX - Juergen Gotteswinter
> >>> <juergen.gotteswinter at internetx.com
> >>> <mailto:juergen.gotteswinter at internetx.com>> wrote:
> >>>
> >>>
> >>>
> >>>    Am 17.08.2016 um 20:03 schrieb Linda Kateley:
> >>>> I just do consulting so I don't always get to see the end of the
> >>>> project. Although we are starting to do more ongoing support so we can
> >>>> see the progress..
> >>>>
> >>>> I have worked with some of the guys from high-availability.com <
> http://high-availability.com> for maybe
> >>>> 20 years. RSF-1 is the cluster that is bundled with nexenta. Does work
> >>>> beautifully with omni/illumos. The one customer I have running it in
> >>>> prod is an isp in south america running openstack and zfs on freebsd
> as
> >>>> iscsi. Big boxes, 90+ drives per frame.  If someone would like try
> it, i
> >>>> have some contacts there. Ping me offlist.
> >>>
> >>>    no offense, but it sounds a bit like marketing.
> >>>
> >>>    here: running nexenta ha setup since several years with one
> catastrophic
> >>>    failure due to split brain
> >>>
> >>>>
> >>>> You do risk losing data if you batch zfs send. It is very hard to run
> >>>> that real time.
> >>>
> >>>    depends on how much data changes aka delta size
> >>>
> >>>
> >>>    You have to take the snap then send the snap. Most
> >>>> people run in cron, even if it's not in cron, you would want one to
> >>>> finish before you started the next.
> >>>
> >>>    thats the reason why lock files where invented, tools like zrep
> handle
> >>>    that themself via additional zfs properties
> >>>
> >>>    or, if one does not trust a single layer
> >>>
> >>>    -- snip --
> >>>    #!/bin/sh
> >>>    if [ ! -f /var/run/replic ] ; then
> >>>            touch /var/run/replic
> >>>            /blah/path/zrep sync all >> /var/log/zfsrepli.log
> >>>            rm -f /var/run/replic
> >>>    fi
> >>>    -- snip --
> >>>
> >>>    something like this, simple
> >>>
> >>>     If you lose the sending host before
> >>>> the receive is complete you won't have a full copy.
> >>>
> >>>    if rsf fails, and you end up in split brain you loose way more. been
> >>>    there, seen that.
> >>>
> >>>    With zfs though you
> >>>> will probably still have the data on the sending host, however long it
> >>>> takes to bring it back up. RSF-1 runs in the zfs stack and send the
> >>>> writes to the second system. It's kind of pricey, but actually much
> less
> >>>> expensive than commercial alternatives.
> >>>>
> >>>> Anytime you run anything sync it adds latency but makes things safer..
> >>>
> >>>    not surprising, it all depends on the usecase
> >>>
> >>>> There is also a cool tool I like, called zerto for vmware that sits in
> >>>> the hypervisor and sends a sync copy of a write locally and then an
> >>>> async remotely. It's pretty cool. Although I haven't run it myself,
> have
> >>>> a bunch of customers running it. I believe it works with proxmox too.
> >>>>
> >>>> Most people I run into (these days) don't mind losing 5 or even 30
> >>>> minutes of data. Small shops.
> >>>
> >>>    you talk about minutes, what delta size are we talking here about?
> why
> >>>    not using zrep in a loop for example
> >>>
> >>>     They usually have a copy somewhere else.
> >>>> Or the cost of 5-30 minutes isn't that great. I used work as a
> >>>> datacenter architect for sun/oracle with only fortune 500. There
> losing
> >>>> 1 sec could put large companies out of business. I worked with banks
> and
> >>>> exchanges.
> >>>
> >>>    again, usecase. i bet 99% on this list are not operating fortune 500
> >>>    bank filers
> >>>
> >>>    They couldn't ever lose a single transaction. Most people
> >>>> nowadays do the replication/availability in the application though and
> >>>> don't care about underlying hardware, especially disk.
> >>>>
> >>>>
> >>>> On 8/17/16 11:55 AM, Chris Watson wrote:
> >>>>> Of course, if you are willing to accept some amount of data loss that
> >>>>> opens up a lot more options. :)
> >>>>>
> >>>>> Some may find that acceptable though. Like turning off fsync with
> >>>>> PostgreSQL to get much higher throughput. As little no as you are
> >>>    made
> >>>>> *very* aware of the risks.
> >>>>>
> >>>>> It's good to have input in this thread from one with more experience
> >>>>> with RSF-1 than the rest of us. You confirm what others have that
> >>>    said
> >>>>> about RSF-1, that it's stable and works well. What were you deploying
> >>>>> it on?
> >>>>>
> >>>>> Chris
> >>>>>
> >>>>> Sent from my iPhone 5
> >>>>>
> >>>>> On Aug 17, 2016, at 11:18 AM, Linda Kateley <lkateley at kateley.com
> >>>    <mailto:lkateley at kateley.com>
> >>>>> <mailto:lkateley at kateley.com <mailto:lkateley at kateley.com>>> wrote:
> >>>>>
> >>>>>> The question I always ask, as an architect, is "can you lose 1
> >>>    minute
> >>>>>> worth of data?" If you can, then batched replication is perfect. If
> >>>>>> you can't.. then HA. Every place I have positioned it, rsf-1 has
> >>>>>> worked extremely well. If i remember right, it works at the dmu. I
> >>>>>> would suggest try it. They have been trying to have a full freebsd
> >>>>>> solution, I have several customers running it well.
> >>>>>>
> >>>>>> linda
> >>>>>>
> >>>>>>
> >>>>>> On 8/17/16 4:52 AM, Julien Cigar wrote:
> >>>>>>> On Wed, Aug 17, 2016 at 11:05:46AM +0200, InterNetX - Juergen
> >>>>>>> Gotteswinter wrote:
> >>>>>>>>
> >>>>>>>> Am 17.08.2016 um 10:54 schrieb Julien Cigar:
> >>>>>>>>> On Wed, Aug 17, 2016 at 09:25:30AM +0200, InterNetX - Juergen
> >>>>>>>>> Gotteswinter wrote:
> >>>>>>>>>>
> >>>>>>>>>> Am 11.08.2016 um 11:24 schrieb Borja Marcos:
> >>>>>>>>>>>> On 11 Aug 2016, at 11:10, Julien Cigar <julien at perdition.city
> >>>>>>>>>>>> <mailto:julien at perdition.city
> >>>    <mailto:julien at perdition.city>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> As I said in a previous post I tested the zfs send/receive
> >>>>>>>>>>>> approach (with
> >>>>>>>>>>>> zrep) and it works (more or less) perfectly.. so I concur in
> >>>>>>>>>>>> all what you
> >>>>>>>>>>>> said, especially about off-site replicate and synchronous
> >>>>>>>>>>>> replication.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Out of curiosity I'm also testing a ZFS + iSCSI + CARP at the
> >>>>>>>>>>>> moment,
> >>>>>>>>>>>> I'm in the early tests, haven't done any heavy writes yet, but
> >>>>>>>>>>>> ATM it
> >>>>>>>>>>>> works as expected, I havent' managed to corrupt the zpool.
> >>>>>>>>>>> I must be too old school, but I don’t quite like the idea of
> >>>>>>>>>>> using an essentially unreliable transport
> >>>>>>>>>>> (Ethernet) for low-level filesystem operations.
> >>>>>>>>>>>
> >>>>>>>>>>> In case something went wrong, that approach could risk
> >>>>>>>>>>> corrupting a pool. Although, frankly,
> >>>>>>>>>>> ZFS is extremely resilient. One of mine even survived a SAS HBA
> >>>>>>>>>>> problem that caused some
> >>>>>>>>>>> silent corruption.
> >>>>>>>>>> try dual split import :D i mean, zpool -f import on 2 machines
> >>>>>>>>>> hooked up
> >>>>>>>>>> to the same disk chassis.
> >>>>>>>>> Yes this is the first thing on the list to avoid .. :)
> >>>>>>>>>
> >>>>>>>>> I'm still busy to test the whole setup here, including the
> >>>>>>>>> MASTER -> BACKUP failover script (CARP), but I think you can
> >>>    prevent
> >>>>>>>>> that thanks to:
> >>>>>>>>>
> >>>>>>>>> - As long as ctld is running on the BACKUP the disks are locked
> >>>>>>>>> and you can't import the pool (even with -f) for ex (filer2
> >>>    is the
> >>>>>>>>> BACKUP):
> >>>>>>>>>
> >>>    https://gist.github.com/silenius/f9536e081d473ba4fddd50f59c56b58f
> >>>    <https://gist.github.com/silenius/f9536e081d473ba4fddd50f59c56b58f>
> >>>>>>>>>
> >>>>>>>>> - The shared pool should not be mounted at boot, and you should
> >>>>>>>>> ensure
> >>>>>>>>> that the failover script is not executed during boot time too:
> >>>>>>>>> this is
> >>>>>>>>> to handle the case wherein both machines turn off and/or
> >>>    re-ignite at
> >>>>>>>>> the same time. Indeed, the CARP interface can "flip" it's status
> >>>>>>>>> if both
> >>>>>>>>> machines are powered on at the same time, for ex:
> >>>>>>>>>
> >>>    https://gist.github.com/silenius/344c3e998a1889f988fdfc3ceba57aaf
> >>>    <https://gist.github.com/silenius/344c3e998a1889f988fdfc3ceba57aaf>
> and
> >>>>>>>>> you will have a split-brain scenario
> >>>>>>>>>
> >>>>>>>>> - Sometimes you'll need to reboot the MASTER for some $reasons
> >>>>>>>>> (freebsd-update, etc) and the MASTER -> BACKUP switch should not
> >>>>>>>>> happen, this can be handled with a trigger file or something like
> >>>>>>>>> that
> >>>>>>>>>
> >>>>>>>>> - I've still have to check if the order is OK, but I think
> >>>    that as
> >>>>>>>>> long
> >>>>>>>>> as you shutdown the replication interface and that you adapt the
> >>>>>>>>> advskew (including the config file) of the CARP interface
> >>>    before the
> >>>>>>>>> zpool import -f in the failover script you can be relatively
> >>>>>>>>> confident
> >>>>>>>>> that nothing will be written on the iSCSI targets
> >>>>>>>>>
> >>>>>>>>> - A zpool scrub should be run at regular intervals
> >>>>>>>>>
> >>>>>>>>> This is my MASTER -> BACKUP CARP script ATM
> >>>>>>>>>
> >>>    https://gist.github.com/silenius/7f6ee8030eb6b923affb655a259bfef7
> >>>    <https://gist.github.com/silenius/7f6ee8030eb6b923affb655a259bfef7>
> >>>>>>>>>
> >>>>>>>>> Julien
> >>>>>>>>>
> >>>>>>>> 100€ question without detailed looking at that script. yes from a
> >>>>>>>> first
> >>>>>>>> view its super simple, but: why are solutions like rsf-1 such more
> >>>>>>>> powerful / featurerich. Theres a reason for, which is that
> >>>    they try to
> >>>>>>>> cover every possible situation (which makes more than sense
> >>>    for this).
> >>>>>>> I've never used "rsf-1" so I can't say much more about it, but
> >>>    I have
> >>>>>>> no doubts about it's ability to handle "complex situations", where
> >>>>>>> multiple nodes / networks are involved.
> >>>>>>>
> >>>>>>>> That script works for sure, within very limited cases imho
> >>>>>>>>
> >>>>>>>>>> kaboom, really ugly kaboom. thats what is very likely to happen
> >>>>>>>>>> sooner
> >>>>>>>>>> or later especially when it comes to homegrown automatism
> >>>    solutions.
> >>>>>>>>>> even the commercial parts where much more time/work goes
> >>>    into such
> >>>>>>>>>> solutions fail in a regular manner
> >>>>>>>>>>
> >>>>>>>>>>> The advantage of ZFS send/receive of datasets is, however, that
> >>>>>>>>>>> you can consider it
> >>>>>>>>>>> essentially atomic. A transport corruption should not cause
> >>>>>>>>>>> trouble (apart from a failed
> >>>>>>>>>>> "zfs receive") and with snapshot retention you can even roll
> >>>>>>>>>>> back. You can’t roll back
> >>>>>>>>>>> zpool replications :)
> >>>>>>>>>>>
> >>>>>>>>>>> ZFS receive does a lot of sanity checks as well. As long as
> >>>    your
> >>>>>>>>>>> zfs receive doesn’t involve a rollback
> >>>>>>>>>>> to the latest snapshot, it won’t destroy anything by mistake.
> >>>>>>>>>>> Just make sure that your replica datasets
> >>>>>>>>>>> aren’t mounted and zfs receive won’t complain.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Cheers,
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Borja.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> _______________________________________________
> >>>>>>>>>>> freebsd-fs at freebsd.org <mailto:freebsd-fs at freebsd.org>
> >>>    <mailto:freebsd-fs at freebsd.org <mailto:freebsd-fs at freebsd.org>>
> >>>    mailing list
> >>>>>>>>>>> https://lists.freebsd.org/mailman/listinfo/freebsd-fs
> >>>    <https://lists.freebsd.org/mailman/listinfo/freebsd-fs>
> >>>>>>>>>>> To unsubscribe, send any mail to
> >>>>>>>>>>> "freebsd-fs-unsubscribe at freebsd.org
> >>>    <mailto:freebsd-fs-unsubscribe at freebsd.org>
> >>>>>>>>>>> <mailto:freebsd-fs-unsubscribe at freebsd.org
> >>>    <mailto:freebsd-fs-unsubscribe at freebsd.org>>"
> >>>>>>>>>>>
> >>>>>>>>>> _______________________________________________
> >>>>>>>>>> freebsd-fs at freebsd.org <mailto:freebsd-fs at freebsd.org>
> >>>    <mailto:freebsd-fs at freebsd.org <mailto:freebsd-fs at freebsd.org>>
> >>>    mailing list
> >>>>>>>>>> https://lists.freebsd.org/mailman/listinfo/freebsd-fs
> >>>    <https://lists.freebsd.org/mailman/listinfo/freebsd-fs>
> >>>>>>>>>> To unsubscribe, send any mail to
> >>>>>>>>>> "freebsd-fs-unsubscribe at freebsd.org
> >>>    <mailto:freebsd-fs-unsubscribe at freebsd.org>
> >>>>>>>>>> <mailto:freebsd-fs-unsubscribe at freebsd.org
> >>>    <mailto:freebsd-fs-unsubscribe at freebsd.org>>"
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> freebsd-fs at freebsd.org <mailto:freebsd-fs at freebsd.org>
> >>>    <mailto:freebsd-fs at freebsd.org <mailto:freebsd-fs at freebsd.org>>
> >>>    mailing list
> >>>>>> https://lists.freebsd.org/mailman/listinfo/freebsd-fs
> >>>    <https://lists.freebsd.org/mailman/listinfo/freebsd-fs>
> >>>>>> To unsubscribe, send any mail to
> >>>    "freebsd-fs-unsubscribe at freebsd.org
> >>>    <mailto:freebsd-fs-unsubscribe at freebsd.org>
> >>>>>> <mailto:freebsd-fs-unsubscribe at freebsd.org
> >>>    <mailto:freebsd-fs-unsubscribe at freebsd.org>>"
> >>>>
> >>>> _______________________________________________
> >>>> freebsd-fs at freebsd.org <mailto:freebsd-fs at freebsd.org> mailing list
> >>>> https://lists.freebsd.org/mailman/listinfo/freebsd-fs
> >>>    <https://lists.freebsd.org/mailman/listinfo/freebsd-fs>
> >>>> To unsubscribe, send any mail to
> >>>    "freebsd-fs-unsubscribe at freebsd.org
> >>>    <mailto:freebsd-fs-unsubscribe at freebsd.org>"
> >>>    _______________________________________________
> >>>    freebsd-fs at freebsd.org <mailto:freebsd-fs at freebsd.org> mailing list
> >>>    https://lists.freebsd.org/mailman/listinfo/freebsd-fs
> >>>    <https://lists.freebsd.org/mailman/listinfo/freebsd-fs>
> >>>    To unsubscribe, send any mail to "freebsd-fs-unsubscribe@
> freebsd.org
> >>>    <mailto:freebsd-fs-unsubscribe at freebsd.org>"
> >>>
> >>>
> >> _______________________________________________
> >> freebsd-fs at freebsd.org mailing list
> >> https://lists.freebsd.org/mailman/listinfo/freebsd-fs
> >> To unsubscribe, send any mail to "freebsd-fs-unsubscribe at freebsd.org"
> >
> > _______________________________________________
> > freebsd-fs at freebsd.org mailing list
> > https://lists.freebsd.org/mailman/listinfo/freebsd-fs
> > To unsubscribe, send any mail to "freebsd-fs-unsubscribe at freebsd.org"
> >
> _______________________________________________
> freebsd-fs at freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-fs
> To unsubscribe, send any mail to "freebsd-fs-unsubscribe at freebsd.org"
>


More information about the freebsd-fs mailing list