zpool export/import on failover - The pool metadata is corrupted
mxb
mxb at alumni.chalmers.se
Tue Jun 11 08:46:28 UTC 2013
Thanks everyone whom replied.
Removing local L2ARC cache disks (da1,da2) indeed showed to be a cure to my problem.
Next is to test with add/remove after import/export as Jeremy suggested.
//mxb
On 7 jun 2013, at 01:34, Jeremy Chadwick <jdc at koitsu.org> wrote:
> On Fri, Jun 07, 2013 at 12:51:14AM +0200, mxb wrote:
>>
>> Sure, script is not perfects yet and does not handle many of stuff, but moving highlight from zpool import/export to the script itself not that
>> clever,as this works most of the time.
>>
>> Question is WHY ZFS corrupts metadata then it should not. Sometimes.
>> I'v seen stale of zpool then manually importing/exporting pool.
>>
>>
>> On 7 jun 2013, at 00:39, Jeremy Chadwick <jdc at koitsu.org> wrote:
>>
>>> On Fri, Jun 07, 2013 at 12:12:39AM +0200, mxb wrote:
>>>>
>>>> Then MASTER goes down, CARP on the second node goes MASTER (devd.conf, and script for lifting):
>>>>
>>>> root at nfs2:/root # cat /etc/devd.conf
>>>>
>>>>
>>>> notify 30 {
>>>> match "system" "IFNET";
>>>> match "subsystem" "carp0";
>>>> match "type" "LINK_UP";
>>>> action "/etc/zfs_switch.sh active";
>>>> };
>>>>
>>>> notify 30 {
>>>> match "system" "IFNET";
>>>> match "subsystem" "carp0";
>>>> match "type" "LINK_DOWN";
>>>> action "/etc/zfs_switch.sh backup";
>>>> };
>>>>
>>>> root at nfs2:/root # cat /etc/zfs_switch.sh
>>>> #!/bin/sh
>>>>
>>>> DATE=`date +%Y%m%d`
>>>> HOSTNAME=`hostname`
>>>>
>>>> ZFS_POOL="jbod"
>>>>
>>>>
>>>> case $1 in
>>>> active)
>>>> echo "Switching to ACTIVE and importing ZFS" | mail -s ''$DATE': '$HOSTNAME' switching to ACTIVE' root
>>>> sleep 10
>>>> /sbin/zpool import -f jbod
>>>> /etc/rc.d/mountd restart
>>>> /etc/rc.d/nfsd restart
>>>> ;;
>>>> backup)
>>>> echo "Switching to BACKUP and exporting ZFS" | mail -s ''$DATE': '$HOSTNAME' switching to BACKUP' root
>>>> /sbin/zpool export jbod
>>>> /etc/rc.d/mountd restart
>>>> /etc/rc.d/nfsd restart
>>>> ;;
>>>> *)
>>>> exit 0
>>>> ;;
>>>> esac
>>>>
>>>> This works, most of the time, but sometimes I'm forced to re-create pool. Those machines suppose to go into prod.
>>>> Loosing pool(and data inside it) stops me from deploy this setup.
>>>
>>> This script looks highly error-prone. Hasty hasty... :-)
>>>
>>> This script assumes that the "zpool" commands (import and export) always
>>> work/succeed; there is no exit code ($?) checking being used.
>>>
>>> Since this is run from within devd(8): where does stdout/stderr go to
>>> when running a program/script under devd(8)? Does it effectively go
>>> to the bit bucket (/dev/null)? If so, you'd never know if the import or
>>> export actually succeeded or not (the export sounds more likely to be
>>> the problem point).
>>>
>>> I imagine there would be some situations where the export would fail
>>> (some files on filesystems under pool "jbod" still in use), yet CARP is
>>> already blindly assuming everything will be fantastic. Surprise.
>>>
>>> I also do not know if devd.conf(5) "action" commands spawn a sub-shell
>>> (/bin/sh) or not. If they don't, you won't be able to use things like"
>>> 'action "/etc/zfs_switch.sh active >> /var/log/failover.log";'. You
>>> would then need to implement the equivalent of logging within your
>>> zfs_switch.sh script.
>>>
>>> You may want to consider the -f flag to zpool import/export
>>> (particularly export). However there are risks involved -- userland
>>> applications which have an fd/fh open on a file which is stored on a
>>> filesystem that has now completely disappeared can sometimes crash
>>> (segfault) or behave very oddly (100% CPU usage, etc.) depending on how
>>> they're designed.
>>>
>>> Basically what I'm trying to say is that devd(8) being used as a form of
>>> HA (high availability) and load balancing is not always possible.
>>> Real/true HA (especially with SANs) is often done very differently (now
>>> you know why it's often proprietary. :-) )
>
> Add error checking to your script. That's my first and foremost
> recommendation. It's not hard to do, really. :-)
>
> After you do that and still experience the issue (e.g. you see no actual
> errors/issues during the export/import phases), I recommend removing
> the "cache" devices which are "independent" on each system from the pool
> entirely. Quoting you (for readers, since I snipped it from my previous
> reply):
>
>>>> Note, that ZIL(mirrored) resides on external enclosure. Only L2ARC
>>>> is both local and external - da1,da2, da13s2, da14s2
>
> I interpret this to mean the primary and backup nodes (physical systems)
> have actual disks which are not part of the "external enclosure". If
> that's the case -- those disks are always going to vary in their
> contents and metadata. Those are never going to be 100% identical all
> the time (is this not obvious?). I'm surprised your stuff has worked at
> all using that model, honestly.
>
> ZFS is going to bitch/cry if it cannot verify the integrity of certain
> things, all the way down to the L2ARC. That's my understanding of it at
> least, meaning there must always be "some" kind of metadata that has to
> be kept/maintained there.
>
> Alternately you could try doing this:
>
> zpool remove jbod cache daX daY ...
> zpool export jbod
>
> Then on the other system:
>
> zpool import jbod
> zpool add jbod cache daX daY ...
>
> Where daX and daY are the disks which are independent to each system
> (not on the "external enclosure").
>
> Finally, it would also be useful/worthwhile if you would provide
> "dmesg" from both systems and for you to explain the physical wiring
> along with what device (e.g. daX) correlates with what exact thing on
> each system. (We right now have no knowledge of that, and your terse
> explanations imply we do -- we need to know more)
>
> --
> | Jeremy Chadwick jdc at koitsu.org |
> | UNIX Systems Administrator http://jdc.koitsu.org/ |
> | Making life hard for others since 1977. PGP 4BD6C0CB |
>
More information about the freebsd-fs
mailing list