zpool export/import on failover - The pool metadata is corrupted
mxb
mxb at alumni.chalmers.se
Thu Jun 27 09:36:02 UTC 2013
Notation for archives.
I have so far not experienced any problems with both local (per head unit) and external (on disk enclosure) caches while importing
and exporting my pool. Disks I use on both nodes are identical - manufacturer, size, model.
da1,da2 - local
da32,da33 - external
Export/import is done WITHOUT removing/adding local disks.
root at nfs1:/root # zpool status
pool: jbod
state: ONLINE
scan: scrub repaired 0 in 0h0m with 0 errors on Wed Jun 26 13:14:55 2013
config:
NAME STATE READ WRITE CKSUM
jbod ONLINE 0 0 0
raidz3-0 ONLINE 0 0 0
da10 ONLINE 0 0 0
da11 ONLINE 0 0 0
da12 ONLINE 0 0 0
da13 ONLINE 0 0 0
da14 ONLINE 0 0 0
da15 ONLINE 0 0 0
da16 ONLINE 0 0 0
da17 ONLINE 0 0 0
da18 ONLINE 0 0 0
da19 ONLINE 0 0 0
logs
mirror-1 ONLINE 0 0 0
da32s1 ONLINE 0 0 0
da33s1 ONLINE 0 0 0
cache
da32s2 ONLINE 0 0 0
da33s2 ONLINE 0 0 0
da1 ONLINE 0 0 0
da2 ONLINE 0 0 0
On 25 jun 2013, at 21:22, mxb <mxb at alumni.chalmers.se> wrote:
>
> I think I'v found the root of this issue.
> Looks like "wiring down" disks the same way on both nodes (as suggested) fixes this issue.
>
> //mxb
>
> On 20 jun 2013, at 12:30, mxb <mxb at alumni.chalmers.se> wrote:
>
>>
>> Well,
>>
>> I'm back to square one.
>>
>> After some uptime and successful import/export from one node to another, I eventually got 'metadata corruption'.
>> I had no problem with import/export while for ex. rebooting master-node (nfs1), but not THIS time.
>> Metdata got corrupted while rebooting master-node??
>>
>> Any ideas?
>>
>> [root at nfs1 ~]# zpool import
>> pool: jbod
>> id: 7663925948774378610
>> state: FAULTED
>> status: The pool metadata is corrupted.
>> action: The pool cannot be imported due to damaged devices or data.
>> see: http://illumos.org/msg/ZFS-8000-72
>> config:
>>
>> jbod FAULTED corrupted data
>> raidz3-0 ONLINE
>> da3 ONLINE
>> da4 ONLINE
>> da5 ONLINE
>> da6 ONLINE
>> da7 ONLINE
>> da8 ONLINE
>> da9 ONLINE
>> da10 ONLINE
>> da11 ONLINE
>> da12 ONLINE
>> cache
>> da13s2
>> da14s2
>> logs
>> mirror-1 ONLINE
>> da13s1 ONLINE
>> da14s1 ONLINE
>> [root at nfs1 ~]# zpool import jbod
>> cannot import 'jbod': I/O error
>> Destroy and re-create the pool from
>> a backup source.
>> [root at nfs1 ~]#
>>
>> On 11 jun 2013, at 10:46, mxb <mxb at alumni.chalmers.se> wrote:
>>
>>>
>>> Thanks everyone whom replied.
>>> Removing local L2ARC cache disks (da1,da2) indeed showed to be a cure to my problem.
>>>
>>> Next is to test with add/remove after import/export as Jeremy suggested.
>>>
>>> //mxb
>>>
>>> On 7 jun 2013, at 01:34, Jeremy Chadwick <jdc at koitsu.org> wrote:
>>>
>>>> On Fri, Jun 07, 2013 at 12:51:14AM +0200, mxb wrote:
>>>>>
>>>>> Sure, script is not perfects yet and does not handle many of stuff, but moving highlight from zpool import/export to the script itself not that
>>>>> clever,as this works most of the time.
>>>>>
>>>>> Question is WHY ZFS corrupts metadata then it should not. Sometimes.
>>>>> I'v seen stale of zpool then manually importing/exporting pool.
>>>>>
>>>>>
>>>>> On 7 jun 2013, at 00:39, Jeremy Chadwick <jdc at koitsu.org> wrote:
>>>>>
>>>>>> On Fri, Jun 07, 2013 at 12:12:39AM +0200, mxb wrote:
>>>>>>>
>>>>>>> Then MASTER goes down, CARP on the second node goes MASTER (devd.conf, and script for lifting):
>>>>>>>
>>>>>>> root at nfs2:/root # cat /etc/devd.conf
>>>>>>>
>>>>>>>
>>>>>>> notify 30 {
>>>>>>> match "system" "IFNET";
>>>>>>> match "subsystem" "carp0";
>>>>>>> match "type" "LINK_UP";
>>>>>>> action "/etc/zfs_switch.sh active";
>>>>>>> };
>>>>>>>
>>>>>>> notify 30 {
>>>>>>> match "system" "IFNET";
>>>>>>> match "subsystem" "carp0";
>>>>>>> match "type" "LINK_DOWN";
>>>>>>> action "/etc/zfs_switch.sh backup";
>>>>>>> };
>>>>>>>
>>>>>>> root at nfs2:/root # cat /etc/zfs_switch.sh
>>>>>>> #!/bin/sh
>>>>>>>
>>>>>>> DATE=`date +%Y%m%d`
>>>>>>> HOSTNAME=`hostname`
>>>>>>>
>>>>>>> ZFS_POOL="jbod"
>>>>>>>
>>>>>>>
>>>>>>> case $1 in
>>>>>>> active)
>>>>>>> echo "Switching to ACTIVE and importing ZFS" | mail -s ''$DATE': '$HOSTNAME' switching to ACTIVE' root
>>>>>>> sleep 10
>>>>>>> /sbin/zpool import -f jbod
>>>>>>> /etc/rc.d/mountd restart
>>>>>>> /etc/rc.d/nfsd restart
>>>>>>> ;;
>>>>>>> backup)
>>>>>>> echo "Switching to BACKUP and exporting ZFS" | mail -s ''$DATE': '$HOSTNAME' switching to BACKUP' root
>>>>>>> /sbin/zpool export jbod
>>>>>>> /etc/rc.d/mountd restart
>>>>>>> /etc/rc.d/nfsd restart
>>>>>>> ;;
>>>>>>> *)
>>>>>>> exit 0
>>>>>>> ;;
>>>>>>> esac
>>>>>>>
>>>>>>> This works, most of the time, but sometimes I'm forced to re-create pool. Those machines suppose to go into prod.
>>>>>>> Loosing pool(and data inside it) stops me from deploy this setup.
>>>>>>
>>>>>> This script looks highly error-prone. Hasty hasty... :-)
>>>>>>
>>>>>> This script assumes that the "zpool" commands (import and export) always
>>>>>> work/succeed; there is no exit code ($?) checking being used.
>>>>>>
>>>>>> Since this is run from within devd(8): where does stdout/stderr go to
>>>>>> when running a program/script under devd(8)? Does it effectively go
>>>>>> to the bit bucket (/dev/null)? If so, you'd never know if the import or
>>>>>> export actually succeeded or not (the export sounds more likely to be
>>>>>> the problem point).
>>>>>>
>>>>>> I imagine there would be some situations where the export would fail
>>>>>> (some files on filesystems under pool "jbod" still in use), yet CARP is
>>>>>> already blindly assuming everything will be fantastic. Surprise.
>>>>>>
>>>>>> I also do not know if devd.conf(5) "action" commands spawn a sub-shell
>>>>>> (/bin/sh) or not. If they don't, you won't be able to use things like"
>>>>>> 'action "/etc/zfs_switch.sh active >> /var/log/failover.log";'. You
>>>>>> would then need to implement the equivalent of logging within your
>>>>>> zfs_switch.sh script.
>>>>>>
>>>>>> You may want to consider the -f flag to zpool import/export
>>>>>> (particularly export). However there are risks involved -- userland
>>>>>> applications which have an fd/fh open on a file which is stored on a
>>>>>> filesystem that has now completely disappeared can sometimes crash
>>>>>> (segfault) or behave very oddly (100% CPU usage, etc.) depending on how
>>>>>> they're designed.
>>>>>>
>>>>>> Basically what I'm trying to say is that devd(8) being used as a form of
>>>>>> HA (high availability) and load balancing is not always possible.
>>>>>> Real/true HA (especially with SANs) is often done very differently (now
>>>>>> you know why it's often proprietary. :-) )
>>>>
>>>> Add error checking to your script. That's my first and foremost
>>>> recommendation. It's not hard to do, really. :-)
>>>>
>>>> After you do that and still experience the issue (e.g. you see no actual
>>>> errors/issues during the export/import phases), I recommend removing
>>>> the "cache" devices which are "independent" on each system from the pool
>>>> entirely. Quoting you (for readers, since I snipped it from my previous
>>>> reply):
>>>>
>>>>>>> Note, that ZIL(mirrored) resides on external enclosure. Only L2ARC
>>>>>>> is both local and external - da1,da2, da13s2, da14s2
>>>>
>>>> I interpret this to mean the primary and backup nodes (physical systems)
>>>> have actual disks which are not part of the "external enclosure". If
>>>> that's the case -- those disks are always going to vary in their
>>>> contents and metadata. Those are never going to be 100% identical all
>>>> the time (is this not obvious?). I'm surprised your stuff has worked at
>>>> all using that model, honestly.
>>>>
>>>> ZFS is going to bitch/cry if it cannot verify the integrity of certain
>>>> things, all the way down to the L2ARC. That's my understanding of it at
>>>> least, meaning there must always be "some" kind of metadata that has to
>>>> be kept/maintained there.
>>>>
>>>> Alternately you could try doing this:
>>>>
>>>> zpool remove jbod cache daX daY ...
>>>> zpool export jbod
>>>>
>>>> Then on the other system:
>>>>
>>>> zpool import jbod
>>>> zpool add jbod cache daX daY ...
>>>>
>>>> Where daX and daY are the disks which are independent to each system
>>>> (not on the "external enclosure").
>>>>
>>>> Finally, it would also be useful/worthwhile if you would provide
>>>> "dmesg" from both systems and for you to explain the physical wiring
>>>> along with what device (e.g. daX) correlates with what exact thing on
>>>> each system. (We right now have no knowledge of that, and your terse
>>>> explanations imply we do -- we need to know more)
>>>>
>>>> --
>>>> | Jeremy Chadwick jdc at koitsu.org |
>>>> | UNIX Systems Administrator http://jdc.koitsu.org/ |
>>>> | Making life hard for others since 1977. PGP 4BD6C0CB |
>>>>
>>>
>>
>
More information about the freebsd-fs
mailing list