HAST and CARP

Mikolaj Golub to.my.trociny at gmail.com
Fri Jul 2 07:11:30 UTC 2010


On Fri, 02 Jul 2010 14:16:05 +0900 hiroshi at soupacific.com wrote:

 h> I checked that both node communication is established. And here is log
 h> without hastctl create.


 h> Seems ServerB once became MASTER, then back to BACKUP.
 h> This situation cause unhappy split-brain happened.
 h> hastctl dump shows prevrole: primary

 h> error debug los is this

 h> Jul  2 12:31:37 fw01B kernel: Clearing /tmp (X related).
 h> Jul  2 12:31:37 fw01B kernel: Updating motd:
 h> Jul  2 12:31:37 fw01B kernel: .
 h> Jul  2 12:31:37 fw01B kernel: Configuring syscons:
 h> Jul  2 12:31:37 fw01B kernel: blanktime
 h> Jul  2 12:31:37 fw01B kernel: .
 h> Jul  2 12:31:38 fw01B sm-mta[879]: gethostbyaddr(211.19.53.206) failed: 2
 h> Jul  2 12:31:38 fw01B sm-mta[879]: gethostbyaddr(211.19.53.202) failed: 2
 h> Jul  2 12:31:38 fw01B sm-mta[880]: starting daemon (8.14.4):
 h> SMTP+queueing at 00:30:00
 h> Jul  2 12:31:38 fw01B sm-msp-queue[884]: starting daemon (8.14.4):
 h> queueing at 00:30:00
 h> Jul  2 12:31:38 fw01B kernel: Starting cron.
 h> Jul  2 12:31:38 fw01B kernel: Starting background file system checks
 h> in 60 seconds.
 h> Jul  2 12:31:38 fw01B kernel:
 h> Jul  2 12:31:38 fw01B kernel: Fri Jul  2 12:31:38 UTC 2010
 h> Jul  2 12:31:40 fw01B kernel: carp0: INIT -> BACKUP
 h> Jul  2 12:31:40 fw01B kernel: alc0: link state changed to UP
 h> Jul  2 12:31:40 fw01B kernel: carp0: 2 link states coalesced
 h> Jul  2 12:31:40 fw01B kernel: carp0: link state changed to DOWN
 h> Jul  2 12:31:43 fw01B login: login on ttyv0 as root
 h> Jul  2 12:31:43 fw01B login: ROOT LOGIN (root) ON ttyv0
 h> Jul  2 12:31:43 fw01B kernel: Jul  2 12:31:43 fw01B login: ROOT LOGIN
 h> (root) ON ttyv0
 h> Jul  2 12:31:48 fw01B hastd: Accepting connection to tcp4://0.0.0.0:8457.
 h> Jul  2 12:31:48 fw01B hastd: Connection from
 h> tcp4://211.19.53.201:20070 to tcp4://211.19.53.206:8457.
 h> Jul  2 12:31:48 fw01B hastd: tcp4://211.19.53.201:20070: resource=zfshast
 h> Jul  2 12:31:48 fw01B hastd: [zfshast] (init) We act as init for the
 h> resource and not as secondary as requested by
 h> tcp4://211.19.53.201:20070.
 h> Jul  2 12:31:48 fw01B kernel: Jul  2 12:31:48 fw01B hastd: [zfshast]
 h> (init) We act as init for the resource and not as secondary as
 h> requested by tcp4://211.19.53.201:20070.
 h> Jul  2 12:31:53 fw01B hastd: Accepting connection to tcp4://0.0.0.0:8457.
 h> Jul  2 12:31:53 fw01B hastd: Connection from
 h> tcp4://211.19.53.201:11542 to tcp4://211.19.53.206:8457.
 h> Jul  2 12:31:53 fw01B hastd: tcp4://211.19.53.201:11542: resource=zfshast
 h> Jul  2 12:31:53 fw01B hastd: [zfshast] (init) We act as init for the
 h> resource and not as secondary as requested by
 h> tcp4://211.19.53.201:11542.
 h> Jul  2 12:31:53 fw01B kernel: Jul  2 12:31:53 fw01B hastd: [zfshast]
 h> (init) We act as init for the resource and not as secondary as
 h> requested by tcp4://211.19.53.201:11542.
 h> Jul  2 12:31:58 fw01B hastd: Accepting connection to tcp4://0.0.0.0:8457.
 h> Jul  2 12:31:58 fw01B hastd: Connection from
 h> tcp4://211.19.53.201:49777 to tcp4://211.19.53.206:8457.
 h> Jul  2 12:31:58 fw01B hastd: tcp4://211.19.53.201:49777: resource=zfshast
 h> Jul  2 12:31:58 fw01B hastd: [zfshast] (init) We act as init for the
 h> resource and not as secondary as requested by
 h> tcp4://211.19.53.201:49777.
 h> Jul  2 12:31:58 fw01B kernel: Jul  2 12:31:58 fw01B hastd: [zfshast]
 h> (init) We act as init for the resource and not as secondary as
 h> requested by tcp4://211.19.53.201:49777.
 h> Jul  2 12:31:58 fw01B hastd: [zfshast] (init) Role changed to secondary.
 h> Jul  2 12:32:03 fw01B hastd: Accepting connection to tcp4://0.0.0.0:8457.
 h> Jul  2 12:32:03 fw01B hastd: Connection from
 h> tcp4://211.19.53.201:17014 to tcp4://211.19.53.206:8457.
 h> Jul  2 12:32:03 fw01B hastd: tcp4://211.19.53.201:17014: resource=zfshast
 h> Jul  2 12:32:03 fw01B hastd: [zfshast] (secondary) Initial connection
 h> from tcp4://211.19.53.201:17014.
 h> Jul  2 12:32:03 fw01B hastd: [zfshast] (secondary) Incoming connection
 h> from tcp4://211.19.53.201:17014 configured.
 h> Jul  2 12:32:03 fw01B hastd: Accepting connection to tcp4://0.0.0.0:8457.
 h> Jul  2 12:32:03 fw01B hastd: Connection from
 h> tcp4://211.19.53.201:42420 to tcp4://211.19.53.206:8457.
 h> Jul  2 12:32:03 fw01B hastd: tcp4://211.19.53.201:42420: resource=zfshast
 h> Jul  2 12:32:03 fw01B hastd: [zfshast] (secondary) Outgoing connection
 h> to tcp4://211.19.53.201:42420 configured.
 h> Jul  2 12:32:03 fw01B hastd: [zfshast] (secondary) hastd_secondary
 h> Jul  2 12:32:03 fw01B hastd: [zfshast] (secondary) calling init_local()
 h> Jul  2 12:32:03 fw01B hastd: [zfshast] (secondary) init_local
 h> Jul  2 12:32:03 fw01B hastd: [zfshast] (secondary) Obtained info about
 h> /dev/ad4p4.
 h> Jul  2 12:32:03 fw01B hastd: [zfshast] (secondary) Locked /dev/ad4p4.
 h> Jul  2 12:32:03 fw01B hastd: [zfshast] (secondary) inside metadata.c
 h> res->hr_role !=HAST_ROLE_PRIMAR
 h> Jul  2 12:32:03 fw01B hastd: [zfshast] (secondary) inside mettadata
 h> secondary_localcnt: 1 secondary_remotecnt: 0
 h> Jul  2 12:32:03 fw01B hastd: [zfshast] (secondary) calling init_remote()
 h> Jul  2 12:32:03 fw01B hastd: [zfshast] (secondary) init_remote()
 h> Jul  2 12:32:03 fw01B hastd: [zfshast] (secondary) humhum secondary
 h> local 1:  secondary remote 0
 h> Jul  2 12:32:03 fw01B hastd: [zfshast] (secondary) init
 h> hr_secondary_remotecnt: 0 hr_primary_remotecnt: 0
 h> Jul  2 12:32:03 fw01B hastd: [zfshast] (secondary) secondary_remotecnt
 h> 0, primary_remotecnt 0
 h> Jul  2 12:32:03 fw01B kernel: Jul  2 12:32:03 fw01B hastd: [zfshast]
 h> (secondary) secondary_remotecnt 0, primary_remotecnt 0
 h> Jul  2 12:32:03 fw01B hastd: [zfshast] (secondary) secondary_localcnt
 h> 1, primary_localcnt 1
 h> Jul  2 12:32:03 fw01B kernel: Jul  2 12:32:03 fw01B hastd: [zfshast]
 h> (secondary) secondary_localcnt 1, primary_localcnt 1
 h> Jul  2 12:32:03 fw01B hastd: [zfshast] (secondary) Split-brain
 h> detected, exiting.
 h> Jul  2 12:32:03 fw01B kernel: Jul  2 12:32:03 fw01B hastd: [zfshast]
 h> (secondary) Split-brain detected, exiting.

So you have:

secondary localcnt:	1
secondary remotecnt:	0
primary localcnt:	1
primary remotecnt:	0

This is a split-brain condition as described on wiki: primary's localcnt is
greater than secondary's remotecnt (primary [fw01A] was modified while fw01B
wasn't watching) and secondary's localcnt is greater than primary's remotecnt
(fw01B was modified while fw01A wasn't watching).

 h> Hope this logs can help you ! If you need to make me debug bit more,
 h> give me some idea to check!

Actually the logs you have provided are not very interesting as they shows the
state after bad things happened. It is more interesting to look at the logs
(both hosts) before split brain.

I would recommend:

1) Configure hast manually and ensure that both primary and secondary function
properly and data are synchronized between the nodes. Also make sure the clock
on both hosts is in sync (needed when comparing logs).

2) Reboot both servers so your carp/hast setup auto starts and see what
happens.

3) If it sets primary and secondary automatically and status is ok on both
nodes initiate switching to failover.

4) If after switching (or earlier) split brain is detected, provide logs from
both nodes since hosts reboot.

-- 
Mikolaj Golub


More information about the freebsd-fs mailing list