zpool export/import on failover - The pool metadata is corrupted

Thu Jun 20 10:30:36 UTC 2013

Well,

I'm back to square one.

After some uptime and successful import/export from one node to another, I eventually got 'metadata corruption'.
I had no problem with import/export while for ex. rebooting master-node (nfs1), but not THIS time.
Metdata got corrupted while rebooting master-node??

Any ideas? 

[root at nfs1 ~]# zpool import
   pool: jbod
     id: 7663925948774378610
  state: FAULTED
 status: The pool metadata is corrupted.
 action: The pool cannot be imported due to damaged devices or data.
   see: http://illumos.org/msg/ZFS-8000-72
 config:

	jbod        FAULTED  corrupted data
	  raidz3-0  ONLINE
	    da3     ONLINE
	    da4     ONLINE
	    da5     ONLINE
	    da6     ONLINE
	    da7     ONLINE
	    da8     ONLINE
	    da9     ONLINE
	    da10    ONLINE
	    da11    ONLINE
	    da12    ONLINE
	cache
	  da13s2
	  da14s2
	logs
	  mirror-1  ONLINE
	    da13s1  ONLINE
	    da14s1  ONLINE
[root at nfs1 ~]# zpool import jbod
cannot import 'jbod': I/O error
	Destroy and re-create the pool from
	a backup source.
[root at nfs1 ~]#

On 11 jun 2013, at 10:46, mxb <mxb at alumni.chalmers.se> wrote:

> 
> Thanks everyone whom replied.
> Removing local L2ARC cache disks (da1,da2) indeed showed to be a cure to my problem.
> 
> Next is to test with add/remove after import/export as Jeremy suggested.
> 
> //mxb
> 
> On 7 jun 2013, at 01:34, Jeremy Chadwick <jdc at koitsu.org> wrote:
> 
>> On Fri, Jun 07, 2013 at 12:51:14AM +0200, mxb wrote:
>>> 
>>> Sure, script is not perfects yet and does not handle many of stuff, but moving highlight from zpool import/export to the script itself not that
>>> clever,as this works most of the time.
>>> 
>>> Question is WHY ZFS corrupts metadata then it should not. Sometimes.
>>> I'v seen stale of zpool then manually importing/exporting pool.
>>> 
>>> 
>>> On 7 jun 2013, at 00:39, Jeremy Chadwick <jdc at koitsu.org> wrote:
>>> 
>>>> On Fri, Jun 07, 2013 at 12:12:39AM +0200, mxb wrote:
>>>>> 
>>>>> Then MASTER goes down, CARP on the second node goes MASTER (devd.conf, and script for lifting):
>>>>> 
>>>>> root at nfs2:/root # cat /etc/devd.conf
>>>>> 
>>>>> 
>>>>> notify 30 {
>>>>> match "system"		"IFNET";
>>>>> match "subsystem"	"carp0";
>>>>> match "type"		"LINK_UP";
>>>>> action "/etc/zfs_switch.sh active";
>>>>> };
>>>>> 
>>>>> notify 30 {
>>>>> match "system"          "IFNET";
>>>>> match "subsystem"       "carp0";
>>>>> match "type"            "LINK_DOWN";
>>>>> action "/etc/zfs_switch.sh backup";
>>>>> };
>>>>> 
>>>>> root at nfs2:/root # cat /etc/zfs_switch.sh
>>>>> #!/bin/sh
>>>>> 
>>>>> DATE=`date +%Y%m%d`
>>>>> HOSTNAME=`hostname`
>>>>> 
>>>>> ZFS_POOL="jbod"
>>>>> 
>>>>> 
>>>>> case $1 in
>>>>> 	active)
>>>>> 		echo "Switching to ACTIVE and importing ZFS" | mail -s ''$DATE': '$HOSTNAME' switching to ACTIVE' root
>>>>> 		sleep 10
>>>>> 		/sbin/zpool import -f jbod
>>>>> 		/etc/rc.d/mountd restart
>>>>> 		/etc/rc.d/nfsd restart
>>>>> 		;;
>>>>> 	backup)
>>>>> 		echo "Switching to BACKUP and exporting ZFS" | mail -s ''$DATE': '$HOSTNAME' switching to BACKUP' root
>>>>> 		/sbin/zpool export jbod
>>>>> 		/etc/rc.d/mountd restart
>>>>>              /etc/rc.d/nfsd restart
>>>>> 		;;
>>>>> 	*)
>>>>> 		exit 0
>>>>> 		;;
>>>>> esac
>>>>> 
>>>>> This works, most of the time, but sometimes I'm forced to re-create pool. Those machines suppose to go into prod.
>>>>> Loosing pool(and data inside it) stops me from deploy this setup.
>>>> 
>>>> This script looks highly error-prone.  Hasty hasty...  :-)
>>>> 
>>>> This script assumes that the "zpool" commands (import and export) always
>>>> work/succeed; there is no exit code ($?) checking being used.
>>>> 
>>>> Since this is run from within devd(8): where does stdout/stderr go to
>>>> when running a program/script under devd(8)?  Does it effectively go
>>>> to the bit bucket (/dev/null)?  If so, you'd never know if the import or
>>>> export actually succeeded or not (the export sounds more likely to be
>>>> the problem point).
>>>> 
>>>> I imagine there would be some situations where the export would fail
>>>> (some files on filesystems under pool "jbod" still in use), yet CARP is
>>>> already blindly assuming everything will be fantastic.  Surprise.
>>>> 
>>>> I also do not know if devd.conf(5) "action" commands spawn a sub-shell
>>>> (/bin/sh) or not.  If they don't, you won't be able to use things like"
>>>> 'action "/etc/zfs_switch.sh active >> /var/log/failover.log";'.  You
>>>> would then need to implement the equivalent of logging within your
>>>> zfs_switch.sh script.
>>>> 
>>>> You may want to consider the -f flag to zpool import/export
>>>> (particularly export).  However there are risks involved -- userland
>>>> applications which have an fd/fh open on a file which is stored on a
>>>> filesystem that has now completely disappeared can sometimes crash
>>>> (segfault) or behave very oddly (100% CPU usage, etc.) depending on how
>>>> they're designed.
>>>> 
>>>> Basically what I'm trying to say is that devd(8) being used as a form of
>>>> HA (high availability) and load balancing is not always possible.
>>>> Real/true HA (especially with SANs) is often done very differently (now
>>>> you know why it's often proprietary.  :-) )
>> 
>> Add error checking to your script.  That's my first and foremost
>> recommendation.  It's not hard to do, really.  :-)
>> 
>> After you do that and still experience the issue (e.g. you see no actual
>> errors/issues during the export/import phases), I recommend removing
>> the "cache" devices which are "independent" on each system from the pool
>> entirely.  Quoting you (for readers, since I snipped it from my previous
>> reply):
>> 
>>>>> Note, that ZIL(mirrored) resides on external enclosure. Only L2ARC
>>>>> is both local and external - da1,da2, da13s2, da14s2
>> 
>> I interpret this to mean the primary and backup nodes (physical systems)
>> have actual disks which are not part of the "external enclosure".  If
>> that's the case -- those disks are always going to vary in their
>> contents and metadata.  Those are never going to be 100% identical all
>> the time (is this not obvious?).  I'm surprised your stuff has worked at
>> all using that model, honestly.
>> 
>> ZFS is going to bitch/cry if it cannot verify the integrity of certain
>> things, all the way down to the L2ARC.  That's my understanding of it at
>> least, meaning there must always be "some" kind of metadata that has to
>> be kept/maintained there.
>> 
>> Alternately you could try doing this:
>> 
>> zpool remove jbod cache daX daY ...
>> zpool export jbod
>> 
>> Then on the other system:
>> 
>> zpool import jbod
>> zpool add jbod cache daX daY ...
>> 
>> Where daX and daY are the disks which are independent to each system
>> (not on the "external enclosure").
>> 
>> Finally, it would also be useful/worthwhile if you would provide 
>> "dmesg" from both systems and for you to explain the physical wiring
>> along with what device (e.g. daX) correlates with what exact thing on
>> each system.  (We right now have no knowledge of that, and your terse
>> explanations imply we do -- we need to know more)
>> 
>> -- 
>> | Jeremy Chadwick                                   jdc at koitsu.org |
>> | UNIX Systems Administrator                http://jdc.koitsu.org/ |
>> | Making life hard for others since 1977.             PGP 4BD6C0CB |
>> 
>