zpool export/import on failover - The pool metadata is corrupted

Thu Jun 27 11:26:17 UTC 2013

This solution is built on top of CARP.
One of nodes is (as of advskew) a preferred master.

Triggered chain is CARP -> devd -> failover_script.sh (zfs import/export)

On 27 jun 2013, at 11:43, Marcelo Araujo <araujobsdport at gmail.com> wrote:

> For this failover solution, did you create a heartbeat or something such like that? How do you avoid split-brain?
> 
> Best Regards.
> 
> 
> 2013/6/27 mxb <mxb at alumni.chalmers.se>
> 
> Notation for archives.
> 
> I have so far not experienced any problems with both local (per head unit) and external (on disk enclosure) caches while importing
> and exporting my pool. Disks I use on both nodes are identical - manufacturer, size, model.
> 
> da1,da2 - local
> da32,da33 - external
> 
> Export/import is done WITHOUT removing/adding local disks.
> 
> root at nfs1:/root # zpool status
>   pool: jbod
>  state: ONLINE
>   scan: scrub repaired 0 in 0h0m with 0 errors on Wed Jun 26 13:14:55 2013
> config:
> 
>         NAME        STATE     READ WRITE CKSUM
>         jbod        ONLINE       0     0     0
>           raidz3-0  ONLINE       0     0     0
>             da10    ONLINE       0     0     0
>             da11    ONLINE       0     0     0
>             da12    ONLINE       0     0     0
>             da13    ONLINE       0     0     0
>             da14    ONLINE       0     0     0
>             da15    ONLINE       0     0     0
>             da16    ONLINE       0     0     0
>             da17    ONLINE       0     0     0
>             da18    ONLINE       0     0     0
>             da19    ONLINE       0     0     0
>         logs
>           mirror-1  ONLINE       0     0     0
>             da32s1  ONLINE       0     0     0
>             da33s1  ONLINE       0     0     0
>         cache
>           da32s2    ONLINE       0     0     0
>           da33s2    ONLINE       0     0     0
>           da1       ONLINE       0     0     0
>           da2       ONLINE       0     0     0
> 
> On 25 jun 2013, at 21:22, mxb <mxb at alumni.chalmers.se> wrote:
> 
> >
> > I think I'v found the root of this issue.
> > Looks like "wiring down" disks the same way on both nodes (as suggested) fixes this issue.
> >
> > //mxb
> >
> > On 20 jun 2013, at 12:30, mxb <mxb at alumni.chalmers.se> wrote:
> >
> >>
> >> Well,
> >>
> >> I'm back to square one.
> >>
> >> After some uptime and successful import/export from one node to another, I eventually got 'metadata corruption'.
> >> I had no problem with import/export while for ex. rebooting master-node (nfs1), but not THIS time.
> >> Metdata got corrupted while rebooting master-node??
> >>
> >> Any ideas?
> >>
> >> [root at nfs1 ~]# zpool import
> >>  pool: jbod
> >>    id: 7663925948774378610
> >> state: FAULTED
> >> status: The pool metadata is corrupted.
> >> action: The pool cannot be imported due to damaged devices or data.
> >>  see: http://illumos.org/msg/ZFS-8000-72
> >> config:
> >>
> >>      jbod        FAULTED  corrupted data
> >>        raidz3-0  ONLINE
> >>          da3     ONLINE
> >>          da4     ONLINE
> >>          da5     ONLINE
> >>          da6     ONLINE
> >>          da7     ONLINE
> >>          da8     ONLINE
> >>          da9     ONLINE
> >>          da10    ONLINE
> >>          da11    ONLINE
> >>          da12    ONLINE
> >>      cache
> >>        da13s2
> >>        da14s2
> >>      logs
> >>        mirror-1  ONLINE
> >>          da13s1  ONLINE
> >>          da14s1  ONLINE
> >> [root at nfs1 ~]# zpool import jbod
> >> cannot import 'jbod': I/O error
> >>      Destroy and re-create the pool from
> >>      a backup source.
> >> [root at nfs1 ~]#
> >>
> >> On 11 jun 2013, at 10:46, mxb <mxb at alumni.chalmers.se> wrote:
> >>
> >>>
> >>> Thanks everyone whom replied.
> >>> Removing local L2ARC cache disks (da1,da2) indeed showed to be a cure to my problem.
> >>>
> >>> Next is to test with add/remove after import/export as Jeremy suggested.
> >>>
> >>> //mxb
> >>>
> >>> On 7 jun 2013, at 01:34, Jeremy Chadwick <jdc at koitsu.org> wrote:
> >>>
> >>>> On Fri, Jun 07, 2013 at 12:51:14AM +0200, mxb wrote:
> >>>>>
> >>>>> Sure, script is not perfects yet and does not handle many of stuff, but moving highlight from zpool import/export to the script itself not that
> >>>>> clever,as this works most of the time.
> >>>>>
> >>>>> Question is WHY ZFS corrupts metadata then it should not. Sometimes.
> >>>>> I'v seen stale of zpool then manually importing/exporting pool.
> >>>>>
> >>>>>
> >>>>> On 7 jun 2013, at 00:39, Jeremy Chadwick <jdc at koitsu.org> wrote:
> >>>>>
> >>>>>> On Fri, Jun 07, 2013 at 12:12:39AM +0200, mxb wrote:
> >>>>>>>
> >>>>>>> Then MASTER goes down, CARP on the second node goes MASTER (devd.conf, and script for lifting):
> >>>>>>>
> >>>>>>> root at nfs2:/root # cat /etc/devd.conf
> >>>>>>>
> >>>>>>>
> >>>>>>> notify 30 {
> >>>>>>> match "system"          "IFNET";
> >>>>>>> match "subsystem"       "carp0";
> >>>>>>> match "type"            "LINK_UP";
> >>>>>>> action "/etc/zfs_switch.sh active";
> >>>>>>> };
> >>>>>>>
> >>>>>>> notify 30 {
> >>>>>>> match "system"          "IFNET";
> >>>>>>> match "subsystem"       "carp0";
> >>>>>>> match "type"            "LINK_DOWN";
> >>>>>>> action "/etc/zfs_switch.sh backup";
> >>>>>>> };
> >>>>>>>
> >>>>>>> root at nfs2:/root # cat /etc/zfs_switch.sh
> >>>>>>> #!/bin/sh
> >>>>>>>
> >>>>>>> DATE=`date +%Y%m%d`
> >>>>>>> HOSTNAME=`hostname`
> >>>>>>>
> >>>>>>> ZFS_POOL="jbod"
> >>>>>>>
> >>>>>>>
> >>>>>>> case $1 in
> >>>>>>>         active)
> >>>>>>>                 echo "Switching to ACTIVE and importing ZFS" | mail -s ''$DATE': '$HOSTNAME' switching to ACTIVE' root
> >>>>>>>                 sleep 10
> >>>>>>>                 /sbin/zpool import -f jbod
> >>>>>>>                 /etc/rc.d/mountd restart
> >>>>>>>                 /etc/rc.d/nfsd restart
> >>>>>>>                 ;;
> >>>>>>>         backup)
> >>>>>>>                 echo "Switching to BACKUP and exporting ZFS" | mail -s ''$DATE': '$HOSTNAME' switching to BACKUP' root
> >>>>>>>                 /sbin/zpool export jbod
> >>>>>>>                 /etc/rc.d/mountd restart
> >>>>>>>            /etc/rc.d/nfsd restart
> >>>>>>>                 ;;
> >>>>>>>         *)
> >>>>>>>                 exit 0
> >>>>>>>                 ;;
> >>>>>>> esac
> >>>>>>>
> >>>>>>> This works, most of the time, but sometimes I'm forced to re-create pool. Those machines suppose to go into prod.
> >>>>>>> Loosing pool(and data inside it) stops me from deploy this setup.
> >>>>>>
> >>>>>> This script looks highly error-prone.  Hasty hasty...  :-)
> >>>>>>
> >>>>>> This script assumes that the "zpool" commands (import and export) always
> >>>>>> work/succeed; there is no exit code ($?) checking being used.
> >>>>>>
> >>>>>> Since this is run from within devd(8): where does stdout/stderr go to
> >>>>>> when running a program/script under devd(8)?  Does it effectively go
> >>>>>> to the bit bucket (/dev/null)?  If so, you'd never know if the import or
> >>>>>> export actually succeeded or not (the export sounds more likely to be
> >>>>>> the problem point).
> >>>>>>
> >>>>>> I imagine there would be some situations where the export would fail
> >>>>>> (some files on filesystems under pool "jbod" still in use), yet CARP is
> >>>>>> already blindly assuming everything will be fantastic.  Surprise.
> >>>>>>
> >>>>>> I also do not know if devd.conf(5) "action" commands spawn a sub-shell
> >>>>>> (/bin/sh) or not.  If they don't, you won't be able to use things like"
> >>>>>> 'action "/etc/zfs_switch.sh active >> /var/log/failover.log";'.  You
> >>>>>> would then need to implement the equivalent of logging within your
> >>>>>> zfs_switch.sh script.
> >>>>>>
> >>>>>> You may want to consider the -f flag to zpool import/export
> >>>>>> (particularly export).  However there are risks involved -- userland
> >>>>>> applications which have an fd/fh open on a file which is stored on a
> >>>>>> filesystem that has now completely disappeared can sometimes crash
> >>>>>> (segfault) or behave very oddly (100% CPU usage, etc.) depending on how
> >>>>>> they're designed.
> >>>>>>
> >>>>>> Basically what I'm trying to say is that devd(8) being used as a form of
> >>>>>> HA (high availability) and load balancing is not always possible.
> >>>>>> Real/true HA (especially with SANs) is often done very differently (now
> >>>>>> you know why it's often proprietary.  :-) )
> >>>>
> >>>> Add error checking to your script.  That's my first and foremost
> >>>> recommendation.  It's not hard to do, really.  :-)
> >>>>
> >>>> After you do that and still experience the issue (e.g. you see no actual
> >>>> errors/issues during the export/import phases), I recommend removing
> >>>> the "cache" devices which are "independent" on each system from the pool
> >>>> entirely.  Quoting you (for readers, since I snipped it from my previous
> >>>> reply):
> >>>>
> >>>>>>> Note, that ZIL(mirrored) resides on external enclosure. Only L2ARC
> >>>>>>> is both local and external - da1,da2, da13s2, da14s2
> >>>>
> >>>> I interpret this to mean the primary and backup nodes (physical systems)
> >>>> have actual disks which are not part of the "external enclosure".  If
> >>>> that's the case -- those disks are always going to vary in their
> >>>> contents and metadata.  Those are never going to be 100% identical all
> >>>> the time (is this not obvious?).  I'm surprised your stuff has worked at
> >>>> all using that model, honestly.
> >>>>
> >>>> ZFS is going to bitch/cry if it cannot verify the integrity of certain
> >>>> things, all the way down to the L2ARC.  That's my understanding of it at
> >>>> least, meaning there must always be "some" kind of metadata that has to
> >>>> be kept/maintained there.
> >>>>
> >>>> Alternately you could try doing this:
> >>>>
> >>>> zpool remove jbod cache daX daY ...
> >>>> zpool export jbod
> >>>>
> >>>> Then on the other system:
> >>>>
> >>>> zpool import jbod
> >>>> zpool add jbod cache daX daY ...
> >>>>
> >>>> Where daX and daY are the disks which are independent to each system
> >>>> (not on the "external enclosure").
> >>>>
> >>>> Finally, it would also be useful/worthwhile if you would provide
> >>>> "dmesg" from both systems and for you to explain the physical wiring
> >>>> along with what device (e.g. daX) correlates with what exact thing on
> >>>> each system.  (We right now have no knowledge of that, and your terse
> >>>> explanations imply we do -- we need to know more)
> >>>>
> >>>> --
> >>>> | Jeremy Chadwick                                   jdc at koitsu.org |
> >>>> | UNIX Systems Administrator                http://jdc.koitsu.org/ |
> >>>> | Making life hard for others since 1977.             PGP 4BD6C0CB |
> >>>>
> >>>
> >>
> >
> 
> _______________________________________________
> freebsd-fs at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-fs
> To unsubscribe, send any mail to "freebsd-fs-unsubscribe at freebsd.org"
> 
> 
> 
> -- 
> Marcelo Araujo
> araujo at FreeBSD.org