zpool export/import on failover - The pool metadata is corrupted
mxb
mxb at alumni.chalmers.se
Thu Jun 27 11:26:17 UTC 2013
This solution is built on top of CARP.
One of nodes is (as of advskew) a preferred master.
Triggered chain is CARP -> devd -> failover_script.sh (zfs import/export)
On 27 jun 2013, at 11:43, Marcelo Araujo <araujobsdport at gmail.com> wrote:
> For this failover solution, did you create a heartbeat or something such like that? How do you avoid split-brain?
>
> Best Regards.
>
>
> 2013/6/27 mxb <mxb at alumni.chalmers.se>
>
> Notation for archives.
>
> I have so far not experienced any problems with both local (per head unit) and external (on disk enclosure) caches while importing
> and exporting my pool. Disks I use on both nodes are identical - manufacturer, size, model.
>
> da1,da2 - local
> da32,da33 - external
>
> Export/import is done WITHOUT removing/adding local disks.
>
> root at nfs1:/root # zpool status
> pool: jbod
> state: ONLINE
> scan: scrub repaired 0 in 0h0m with 0 errors on Wed Jun 26 13:14:55 2013
> config:
>
> NAME STATE READ WRITE CKSUM
> jbod ONLINE 0 0 0
> raidz3-0 ONLINE 0 0 0
> da10 ONLINE 0 0 0
> da11 ONLINE 0 0 0
> da12 ONLINE 0 0 0
> da13 ONLINE 0 0 0
> da14 ONLINE 0 0 0
> da15 ONLINE 0 0 0
> da16 ONLINE 0 0 0
> da17 ONLINE 0 0 0
> da18 ONLINE 0 0 0
> da19 ONLINE 0 0 0
> logs
> mirror-1 ONLINE 0 0 0
> da32s1 ONLINE 0 0 0
> da33s1 ONLINE 0 0 0
> cache
> da32s2 ONLINE 0 0 0
> da33s2 ONLINE 0 0 0
> da1 ONLINE 0 0 0
> da2 ONLINE 0 0 0
>
> On 25 jun 2013, at 21:22, mxb <mxb at alumni.chalmers.se> wrote:
>
> >
> > I think I'v found the root of this issue.
> > Looks like "wiring down" disks the same way on both nodes (as suggested) fixes this issue.
> >
> > //mxb
> >
> > On 20 jun 2013, at 12:30, mxb <mxb at alumni.chalmers.se> wrote:
> >
> >>
> >> Well,
> >>
> >> I'm back to square one.
> >>
> >> After some uptime and successful import/export from one node to another, I eventually got 'metadata corruption'.
> >> I had no problem with import/export while for ex. rebooting master-node (nfs1), but not THIS time.
> >> Metdata got corrupted while rebooting master-node??
> >>
> >> Any ideas?
> >>
> >> [root at nfs1 ~]# zpool import
> >> pool: jbod
> >> id: 7663925948774378610
> >> state: FAULTED
> >> status: The pool metadata is corrupted.
> >> action: The pool cannot be imported due to damaged devices or data.
> >> see: http://illumos.org/msg/ZFS-8000-72
> >> config:
> >>
> >> jbod FAULTED corrupted data
> >> raidz3-0 ONLINE
> >> da3 ONLINE
> >> da4 ONLINE
> >> da5 ONLINE
> >> da6 ONLINE
> >> da7 ONLINE
> >> da8 ONLINE
> >> da9 ONLINE
> >> da10 ONLINE
> >> da11 ONLINE
> >> da12 ONLINE
> >> cache
> >> da13s2
> >> da14s2
> >> logs
> >> mirror-1 ONLINE
> >> da13s1 ONLINE
> >> da14s1 ONLINE
> >> [root at nfs1 ~]# zpool import jbod
> >> cannot import 'jbod': I/O error
> >> Destroy and re-create the pool from
> >> a backup source.
> >> [root at nfs1 ~]#
> >>
> >> On 11 jun 2013, at 10:46, mxb <mxb at alumni.chalmers.se> wrote:
> >>
> >>>
> >>> Thanks everyone whom replied.
> >>> Removing local L2ARC cache disks (da1,da2) indeed showed to be a cure to my problem.
> >>>
> >>> Next is to test with add/remove after import/export as Jeremy suggested.
> >>>
> >>> //mxb
> >>>
> >>> On 7 jun 2013, at 01:34, Jeremy Chadwick <jdc at koitsu.org> wrote:
> >>>
> >>>> On Fri, Jun 07, 2013 at 12:51:14AM +0200, mxb wrote:
> >>>>>
> >>>>> Sure, script is not perfects yet and does not handle many of stuff, but moving highlight from zpool import/export to the script itself not that
> >>>>> clever,as this works most of the time.
> >>>>>
> >>>>> Question is WHY ZFS corrupts metadata then it should not. Sometimes.
> >>>>> I'v seen stale of zpool then manually importing/exporting pool.
> >>>>>
> >>>>>
> >>>>> On 7 jun 2013, at 00:39, Jeremy Chadwick <jdc at koitsu.org> wrote:
> >>>>>
> >>>>>> On Fri, Jun 07, 2013 at 12:12:39AM +0200, mxb wrote:
> >>>>>>>
> >>>>>>> Then MASTER goes down, CARP on the second node goes MASTER (devd.conf, and script for lifting):
> >>>>>>>
> >>>>>>> root at nfs2:/root # cat /etc/devd.conf
> >>>>>>>
> >>>>>>>
> >>>>>>> notify 30 {
> >>>>>>> match "system" "IFNET";
> >>>>>>> match "subsystem" "carp0";
> >>>>>>> match "type" "LINK_UP";
> >>>>>>> action "/etc/zfs_switch.sh active";
> >>>>>>> };
> >>>>>>>
> >>>>>>> notify 30 {
> >>>>>>> match "system" "IFNET";
> >>>>>>> match "subsystem" "carp0";
> >>>>>>> match "type" "LINK_DOWN";
> >>>>>>> action "/etc/zfs_switch.sh backup";
> >>>>>>> };
> >>>>>>>
> >>>>>>> root at nfs2:/root # cat /etc/zfs_switch.sh
> >>>>>>> #!/bin/sh
> >>>>>>>
> >>>>>>> DATE=`date +%Y%m%d`
> >>>>>>> HOSTNAME=`hostname`
> >>>>>>>
> >>>>>>> ZFS_POOL="jbod"
> >>>>>>>
> >>>>>>>
> >>>>>>> case $1 in
> >>>>>>> active)
> >>>>>>> echo "Switching to ACTIVE and importing ZFS" | mail -s ''$DATE': '$HOSTNAME' switching to ACTIVE' root
> >>>>>>> sleep 10
> >>>>>>> /sbin/zpool import -f jbod
> >>>>>>> /etc/rc.d/mountd restart
> >>>>>>> /etc/rc.d/nfsd restart
> >>>>>>> ;;
> >>>>>>> backup)
> >>>>>>> echo "Switching to BACKUP and exporting ZFS" | mail -s ''$DATE': '$HOSTNAME' switching to BACKUP' root
> >>>>>>> /sbin/zpool export jbod
> >>>>>>> /etc/rc.d/mountd restart
> >>>>>>> /etc/rc.d/nfsd restart
> >>>>>>> ;;
> >>>>>>> *)
> >>>>>>> exit 0
> >>>>>>> ;;
> >>>>>>> esac
> >>>>>>>
> >>>>>>> This works, most of the time, but sometimes I'm forced to re-create pool. Those machines suppose to go into prod.
> >>>>>>> Loosing pool(and data inside it) stops me from deploy this setup.
> >>>>>>
> >>>>>> This script looks highly error-prone. Hasty hasty... :-)
> >>>>>>
> >>>>>> This script assumes that the "zpool" commands (import and export) always
> >>>>>> work/succeed; there is no exit code ($?) checking being used.
> >>>>>>
> >>>>>> Since this is run from within devd(8): where does stdout/stderr go to
> >>>>>> when running a program/script under devd(8)? Does it effectively go
> >>>>>> to the bit bucket (/dev/null)? If so, you'd never know if the import or
> >>>>>> export actually succeeded or not (the export sounds more likely to be
> >>>>>> the problem point).
> >>>>>>
> >>>>>> I imagine there would be some situations where the export would fail
> >>>>>> (some files on filesystems under pool "jbod" still in use), yet CARP is
> >>>>>> already blindly assuming everything will be fantastic. Surprise.
> >>>>>>
> >>>>>> I also do not know if devd.conf(5) "action" commands spawn a sub-shell
> >>>>>> (/bin/sh) or not. If they don't, you won't be able to use things like"
> >>>>>> 'action "/etc/zfs_switch.sh active >> /var/log/failover.log";'. You
> >>>>>> would then need to implement the equivalent of logging within your
> >>>>>> zfs_switch.sh script.
> >>>>>>
> >>>>>> You may want to consider the -f flag to zpool import/export
> >>>>>> (particularly export). However there are risks involved -- userland
> >>>>>> applications which have an fd/fh open on a file which is stored on a
> >>>>>> filesystem that has now completely disappeared can sometimes crash
> >>>>>> (segfault) or behave very oddly (100% CPU usage, etc.) depending on how
> >>>>>> they're designed.
> >>>>>>
> >>>>>> Basically what I'm trying to say is that devd(8) being used as a form of
> >>>>>> HA (high availability) and load balancing is not always possible.
> >>>>>> Real/true HA (especially with SANs) is often done very differently (now
> >>>>>> you know why it's often proprietary. :-) )
> >>>>
> >>>> Add error checking to your script. That's my first and foremost
> >>>> recommendation. It's not hard to do, really. :-)
> >>>>
> >>>> After you do that and still experience the issue (e.g. you see no actual
> >>>> errors/issues during the export/import phases), I recommend removing
> >>>> the "cache" devices which are "independent" on each system from the pool
> >>>> entirely. Quoting you (for readers, since I snipped it from my previous
> >>>> reply):
> >>>>
> >>>>>>> Note, that ZIL(mirrored) resides on external enclosure. Only L2ARC
> >>>>>>> is both local and external - da1,da2, da13s2, da14s2
> >>>>
> >>>> I interpret this to mean the primary and backup nodes (physical systems)
> >>>> have actual disks which are not part of the "external enclosure". If
> >>>> that's the case -- those disks are always going to vary in their
> >>>> contents and metadata. Those are never going to be 100% identical all
> >>>> the time (is this not obvious?). I'm surprised your stuff has worked at
> >>>> all using that model, honestly.
> >>>>
> >>>> ZFS is going to bitch/cry if it cannot verify the integrity of certain
> >>>> things, all the way down to the L2ARC. That's my understanding of it at
> >>>> least, meaning there must always be "some" kind of metadata that has to
> >>>> be kept/maintained there.
> >>>>
> >>>> Alternately you could try doing this:
> >>>>
> >>>> zpool remove jbod cache daX daY ...
> >>>> zpool export jbod
> >>>>
> >>>> Then on the other system:
> >>>>
> >>>> zpool import jbod
> >>>> zpool add jbod cache daX daY ...
> >>>>
> >>>> Where daX and daY are the disks which are independent to each system
> >>>> (not on the "external enclosure").
> >>>>
> >>>> Finally, it would also be useful/worthwhile if you would provide
> >>>> "dmesg" from both systems and for you to explain the physical wiring
> >>>> along with what device (e.g. daX) correlates with what exact thing on
> >>>> each system. (We right now have no knowledge of that, and your terse
> >>>> explanations imply we do -- we need to know more)
> >>>>
> >>>> --
> >>>> | Jeremy Chadwick jdc at koitsu.org |
> >>>> | UNIX Systems Administrator http://jdc.koitsu.org/ |
> >>>> | Making life hard for others since 1977. PGP 4BD6C0CB |
> >>>>
> >>>
> >>
> >
>
> _______________________________________________
> freebsd-fs at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-fs
> To unsubscribe, send any mail to "freebsd-fs-unsubscribe at freebsd.org"
>
>
>
> --
> Marcelo Araujo
> araujo at FreeBSD.org
More information about the freebsd-fs
mailing list