zpool export/import on failover - The pool metadata is corrupted
mxb
mxb at alumni.chalmers.se
Tue Jun 25 19:22:49 UTC 2013
I think I'v found the root of this issue.
Looks like "wiring down" disks the same way on both nodes (as suggested) fixes this issue.
//mxb
On 20 jun 2013, at 12:30, mxb <mxb at alumni.chalmers.se> wrote:
>
> Well,
>
> I'm back to square one.
>
> After some uptime and successful import/export from one node to another, I eventually got 'metadata corruption'.
> I had no problem with import/export while for ex. rebooting master-node (nfs1), but not THIS time.
> Metdata got corrupted while rebooting master-node??
>
> Any ideas?
>
> [root at nfs1 ~]# zpool import
> pool: jbod
> id: 7663925948774378610
> state: FAULTED
> status: The pool metadata is corrupted.
> action: The pool cannot be imported due to damaged devices or data.
> see: http://illumos.org/msg/ZFS-8000-72
> config:
>
> jbod FAULTED corrupted data
> raidz3-0 ONLINE
> da3 ONLINE
> da4 ONLINE
> da5 ONLINE
> da6 ONLINE
> da7 ONLINE
> da8 ONLINE
> da9 ONLINE
> da10 ONLINE
> da11 ONLINE
> da12 ONLINE
> cache
> da13s2
> da14s2
> logs
> mirror-1 ONLINE
> da13s1 ONLINE
> da14s1 ONLINE
> [root at nfs1 ~]# zpool import jbod
> cannot import 'jbod': I/O error
> Destroy and re-create the pool from
> a backup source.
> [root at nfs1 ~]#
>
> On 11 jun 2013, at 10:46, mxb <mxb at alumni.chalmers.se> wrote:
>
>>
>> Thanks everyone whom replied.
>> Removing local L2ARC cache disks (da1,da2) indeed showed to be a cure to my problem.
>>
>> Next is to test with add/remove after import/export as Jeremy suggested.
>>
>> //mxb
>>
>> On 7 jun 2013, at 01:34, Jeremy Chadwick <jdc at koitsu.org> wrote:
>>
>>> On Fri, Jun 07, 2013 at 12:51:14AM +0200, mxb wrote:
>>>>
>>>> Sure, script is not perfects yet and does not handle many of stuff, but moving highlight from zpool import/export to the script itself not that
>>>> clever,as this works most of the time.
>>>>
>>>> Question is WHY ZFS corrupts metadata then it should not. Sometimes.
>>>> I'v seen stale of zpool then manually importing/exporting pool.
>>>>
>>>>
>>>> On 7 jun 2013, at 00:39, Jeremy Chadwick <jdc at koitsu.org> wrote:
>>>>
>>>>> On Fri, Jun 07, 2013 at 12:12:39AM +0200, mxb wrote:
>>>>>>
>>>>>> Then MASTER goes down, CARP on the second node goes MASTER (devd.conf, and script for lifting):
>>>>>>
>>>>>> root at nfs2:/root # cat /etc/devd.conf
>>>>>>
>>>>>>
>>>>>> notify 30 {
>>>>>> match "system" "IFNET";
>>>>>> match "subsystem" "carp0";
>>>>>> match "type" "LINK_UP";
>>>>>> action "/etc/zfs_switch.sh active";
>>>>>> };
>>>>>>
>>>>>> notify 30 {
>>>>>> match "system" "IFNET";
>>>>>> match "subsystem" "carp0";
>>>>>> match "type" "LINK_DOWN";
>>>>>> action "/etc/zfs_switch.sh backup";
>>>>>> };
>>>>>>
>>>>>> root at nfs2:/root # cat /etc/zfs_switch.sh
>>>>>> #!/bin/sh
>>>>>>
>>>>>> DATE=`date +%Y%m%d`
>>>>>> HOSTNAME=`hostname`
>>>>>>
>>>>>> ZFS_POOL="jbod"
>>>>>>
>>>>>>
>>>>>> case $1 in
>>>>>> active)
>>>>>> echo "Switching to ACTIVE and importing ZFS" | mail -s ''$DATE': '$HOSTNAME' switching to ACTIVE' root
>>>>>> sleep 10
>>>>>> /sbin/zpool import -f jbod
>>>>>> /etc/rc.d/mountd restart
>>>>>> /etc/rc.d/nfsd restart
>>>>>> ;;
>>>>>> backup)
>>>>>> echo "Switching to BACKUP and exporting ZFS" | mail -s ''$DATE': '$HOSTNAME' switching to BACKUP' root
>>>>>> /sbin/zpool export jbod
>>>>>> /etc/rc.d/mountd restart
>>>>>> /etc/rc.d/nfsd restart
>>>>>> ;;
>>>>>> *)
>>>>>> exit 0
>>>>>> ;;
>>>>>> esac
>>>>>>
>>>>>> This works, most of the time, but sometimes I'm forced to re-create pool. Those machines suppose to go into prod.
>>>>>> Loosing pool(and data inside it) stops me from deploy this setup.
>>>>>
>>>>> This script looks highly error-prone. Hasty hasty... :-)
>>>>>
>>>>> This script assumes that the "zpool" commands (import and export) always
>>>>> work/succeed; there is no exit code ($?) checking being used.
>>>>>
>>>>> Since this is run from within devd(8): where does stdout/stderr go to
>>>>> when running a program/script under devd(8)? Does it effectively go
>>>>> to the bit bucket (/dev/null)? If so, you'd never know if the import or
>>>>> export actually succeeded or not (the export sounds more likely to be
>>>>> the problem point).
>>>>>
>>>>> I imagine there would be some situations where the export would fail
>>>>> (some files on filesystems under pool "jbod" still in use), yet CARP is
>>>>> already blindly assuming everything will be fantastic. Surprise.
>>>>>
>>>>> I also do not know if devd.conf(5) "action" commands spawn a sub-shell
>>>>> (/bin/sh) or not. If they don't, you won't be able to use things like"
>>>>> 'action "/etc/zfs_switch.sh active >> /var/log/failover.log";'. You
>>>>> would then need to implement the equivalent of logging within your
>>>>> zfs_switch.sh script.
>>>>>
>>>>> You may want to consider the -f flag to zpool import/export
>>>>> (particularly export). However there are risks involved -- userland
>>>>> applications which have an fd/fh open on a file which is stored on a
>>>>> filesystem that has now completely disappeared can sometimes crash
>>>>> (segfault) or behave very oddly (100% CPU usage, etc.) depending on how
>>>>> they're designed.
>>>>>
>>>>> Basically what I'm trying to say is that devd(8) being used as a form of
>>>>> HA (high availability) and load balancing is not always possible.
>>>>> Real/true HA (especially with SANs) is often done very differently (now
>>>>> you know why it's often proprietary. :-) )
>>>
>>> Add error checking to your script. That's my first and foremost
>>> recommendation. It's not hard to do, really. :-)
>>>
>>> After you do that and still experience the issue (e.g. you see no actual
>>> errors/issues during the export/import phases), I recommend removing
>>> the "cache" devices which are "independent" on each system from the pool
>>> entirely. Quoting you (for readers, since I snipped it from my previous
>>> reply):
>>>
>>>>>> Note, that ZIL(mirrored) resides on external enclosure. Only L2ARC
>>>>>> is both local and external - da1,da2, da13s2, da14s2
>>>
>>> I interpret this to mean the primary and backup nodes (physical systems)
>>> have actual disks which are not part of the "external enclosure". If
>>> that's the case -- those disks are always going to vary in their
>>> contents and metadata. Those are never going to be 100% identical all
>>> the time (is this not obvious?). I'm surprised your stuff has worked at
>>> all using that model, honestly.
>>>
>>> ZFS is going to bitch/cry if it cannot verify the integrity of certain
>>> things, all the way down to the L2ARC. That's my understanding of it at
>>> least, meaning there must always be "some" kind of metadata that has to
>>> be kept/maintained there.
>>>
>>> Alternately you could try doing this:
>>>
>>> zpool remove jbod cache daX daY ...
>>> zpool export jbod
>>>
>>> Then on the other system:
>>>
>>> zpool import jbod
>>> zpool add jbod cache daX daY ...
>>>
>>> Where daX and daY are the disks which are independent to each system
>>> (not on the "external enclosure").
>>>
>>> Finally, it would also be useful/worthwhile if you would provide
>>> "dmesg" from both systems and for you to explain the physical wiring
>>> along with what device (e.g. daX) correlates with what exact thing on
>>> each system. (We right now have no knowledge of that, and your terse
>>> explanations imply we do -- we need to know more)
>>>
>>> --
>>> | Jeremy Chadwick jdc at koitsu.org |
>>> | UNIX Systems Administrator http://jdc.koitsu.org/ |
>>> | Making life hard for others since 1977. PGP 4BD6C0CB |
>>>
>>
>
More information about the freebsd-fs
mailing list