HAST and CARP

hiroshi at soupacific.com hiroshi at soupacific.com
Fri Jul 2 05:16:11 UTC 2010


Thanks for your info.

 > Doing 'hastctl create' on every switching is wrong. Note, after 'hastctl
 > create' hast metadata on the disk are lost and synchronization of all 
blocks

That's I was afraid of it !

> Split-brain means that you or your scripts did something wrong: both nodes
> acted as primary (either simultaneously or one then another but there was no
> communication between them so both made changes to the data that was not
> synced to another node).
>

 > The easy way to get split-brain is to change the role on secondary to 
primary
 > without changing the role on the primary host, make some changes on the
 > secondary (acting as a primary) and change back its role to secondary.

I checked that both node communication is established. And here is log 
without hastctl create.


Seems ServerB once became MASTER, then back to BACKUP.
This situation cause unhappy split-brain happened.
hastctl dump shows prevrole: primary

error debug los is this

Jul  2 12:31:37 fw01B kernel: Clearing /tmp (X related).
Jul  2 12:31:37 fw01B kernel: Updating motd:
Jul  2 12:31:37 fw01B kernel: .
Jul  2 12:31:37 fw01B kernel: Configuring syscons:
Jul  2 12:31:37 fw01B kernel: blanktime
Jul  2 12:31:37 fw01B kernel: .
Jul  2 12:31:38 fw01B sm-mta[879]: gethostbyaddr(211.19.53.206) failed: 2
Jul  2 12:31:38 fw01B sm-mta[879]: gethostbyaddr(211.19.53.202) failed: 2
Jul  2 12:31:38 fw01B sm-mta[880]: starting daemon (8.14.4): 
SMTP+queueing at 00:30:00
Jul  2 12:31:38 fw01B sm-msp-queue[884]: starting daemon (8.14.4): 
queueing at 00:30:00
Jul  2 12:31:38 fw01B kernel: Starting cron.
Jul  2 12:31:38 fw01B kernel: Starting background file system checks in 
60 seconds.
Jul  2 12:31:38 fw01B kernel:
Jul  2 12:31:38 fw01B kernel: Fri Jul  2 12:31:38 UTC 2010
Jul  2 12:31:40 fw01B kernel: carp0: INIT -> BACKUP
Jul  2 12:31:40 fw01B kernel: alc0: link state changed to UP
Jul  2 12:31:40 fw01B kernel: carp0: 2 link states coalesced
Jul  2 12:31:40 fw01B kernel: carp0: link state changed to DOWN
Jul  2 12:31:43 fw01B login: login on ttyv0 as root
Jul  2 12:31:43 fw01B login: ROOT LOGIN (root) ON ttyv0
Jul  2 12:31:43 fw01B kernel: Jul  2 12:31:43 fw01B login: ROOT LOGIN 
(root) ON ttyv0
Jul  2 12:31:48 fw01B hastd: Accepting connection to tcp4://0.0.0.0:8457.
Jul  2 12:31:48 fw01B hastd: Connection from tcp4://211.19.53.201:20070 
to tcp4://211.19.53.206:8457.
Jul  2 12:31:48 fw01B hastd: tcp4://211.19.53.201:20070: resource=zfshast
Jul  2 12:31:48 fw01B hastd: [zfshast] (init) We act as init for the 
resource and not as secondary as requested by tcp4://211.19.53.201:20070.
Jul  2 12:31:48 fw01B kernel: Jul  2 12:31:48 fw01B hastd: [zfshast] 
(init) We act as init for the resource and not as secondary as requested 
by tcp4://211.19.53.201:20070.
Jul  2 12:31:53 fw01B hastd: Accepting connection to tcp4://0.0.0.0:8457.
Jul  2 12:31:53 fw01B hastd: Connection from tcp4://211.19.53.201:11542 
to tcp4://211.19.53.206:8457.
Jul  2 12:31:53 fw01B hastd: tcp4://211.19.53.201:11542: resource=zfshast
Jul  2 12:31:53 fw01B hastd: [zfshast] (init) We act as init for the 
resource and not as secondary as requested by tcp4://211.19.53.201:11542.
Jul  2 12:31:53 fw01B kernel: Jul  2 12:31:53 fw01B hastd: [zfshast] 
(init) We act as init for the resource and not as secondary as requested 
by tcp4://211.19.53.201:11542.
Jul  2 12:31:58 fw01B hastd: Accepting connection to tcp4://0.0.0.0:8457.
Jul  2 12:31:58 fw01B hastd: Connection from tcp4://211.19.53.201:49777 
to tcp4://211.19.53.206:8457.
Jul  2 12:31:58 fw01B hastd: tcp4://211.19.53.201:49777: resource=zfshast
Jul  2 12:31:58 fw01B hastd: [zfshast] (init) We act as init for the 
resource and not as secondary as requested by tcp4://211.19.53.201:49777.
Jul  2 12:31:58 fw01B kernel: Jul  2 12:31:58 fw01B hastd: [zfshast] 
(init) We act as init for the resource and not as secondary as requested 
by tcp4://211.19.53.201:49777.
Jul  2 12:31:58 fw01B hastd: [zfshast] (init) Role changed to secondary.
Jul  2 12:32:03 fw01B hastd: Accepting connection to tcp4://0.0.0.0:8457.
Jul  2 12:32:03 fw01B hastd: Connection from tcp4://211.19.53.201:17014 
to tcp4://211.19.53.206:8457.
Jul  2 12:32:03 fw01B hastd: tcp4://211.19.53.201:17014: resource=zfshast
Jul  2 12:32:03 fw01B hastd: [zfshast] (secondary) Initial connection 
from tcp4://211.19.53.201:17014.
Jul  2 12:32:03 fw01B hastd: [zfshast] (secondary) Incoming connection 
from tcp4://211.19.53.201:17014 configured.
Jul  2 12:32:03 fw01B hastd: Accepting connection to tcp4://0.0.0.0:8457.
Jul  2 12:32:03 fw01B hastd: Connection from tcp4://211.19.53.201:42420 
to tcp4://211.19.53.206:8457.
Jul  2 12:32:03 fw01B hastd: tcp4://211.19.53.201:42420: resource=zfshast
Jul  2 12:32:03 fw01B hastd: [zfshast] (secondary) Outgoing connection 
to tcp4://211.19.53.201:42420 configured.
Jul  2 12:32:03 fw01B hastd: [zfshast] (secondary) hastd_secondary
Jul  2 12:32:03 fw01B hastd: [zfshast] (secondary) calling init_local()
Jul  2 12:32:03 fw01B hastd: [zfshast] (secondary) init_local
Jul  2 12:32:03 fw01B hastd: [zfshast] (secondary) Obtained info about 
/dev/ad4p4.
Jul  2 12:32:03 fw01B hastd: [zfshast] (secondary) Locked /dev/ad4p4.
Jul  2 12:32:03 fw01B hastd: [zfshast] (secondary) inside metadata.c 
res->hr_role !=HAST_ROLE_PRIMAR
Jul  2 12:32:03 fw01B hastd: [zfshast] (secondary) inside mettadata 
secondary_localcnt: 1 secondary_remotecnt: 0
Jul  2 12:32:03 fw01B hastd: [zfshast] (secondary) calling init_remote()
Jul  2 12:32:03 fw01B hastd: [zfshast] (secondary) init_remote()
Jul  2 12:32:03 fw01B hastd: [zfshast] (secondary) humhum secondary 
local 1:  secondary remote 0
Jul  2 12:32:03 fw01B hastd: [zfshast] (secondary) init 
hr_secondary_remotecnt: 0 hr_primary_remotecnt: 0
Jul  2 12:32:03 fw01B hastd: [zfshast] (secondary) secondary_remotecnt 
0, primary_remotecnt 0
Jul  2 12:32:03 fw01B kernel: Jul  2 12:32:03 fw01B hastd: [zfshast] 
(secondary) secondary_remotecnt 0, primary_remotecnt 0
Jul  2 12:32:03 fw01B hastd: [zfshast] (secondary) secondary_localcnt 1, 
primary_localcnt 1
Jul  2 12:32:03 fw01B kernel: Jul  2 12:32:03 fw01B hastd: [zfshast] 
(secondary) secondary_localcnt 1, primary_localcnt 1
Jul  2 12:32:03 fw01B hastd: [zfshast] (secondary) Split-brain detected, 
exiting.
Jul  2 12:32:03 fw01B kernel: Jul  2 12:32:03 fw01B hastd: [zfshast] 
(secondary) Split-brain detected, exiting.
Jul  2 12:32:03 fw01B hastd: [zfshast] (secondary) Worker process exited 
ungracefully (pid=979, exitcode=78).
Jul  2 12:32:03 fw01B kernel: Jul  2 12:32:03 fw01B hastd: [zfshast] 
(secondary) Worker process exited ungracefully (pid=979, exitcode=78).
Jul  2 12:32:08 fw01B hastd: Accepting connection to tcp4://0.0.0.0:8457.
Jul  2 12:32:08 fw01B hastd: Connection from tcp4://211.19.53.201:53033 
to tcp4://211.19.53.206:8457.
Jul  2 12:32:08 fw01B hastd: tcp4://211.19.53.201:53033: resource=zfshast
Jul  2 12:32:08 fw01B hastd: [zfshast] (secondary) Initial connection 
from tcp4://211.19.53.201:53033.
Jul  2 12:32:08 fw01B hastd: [zfshast] (secondary) Incoming connection 
from tcp4://211.19.53.201:53033 configured.
Jul  2 12:32:08 fw01B hastd: Accepting connection to tcp4://0.0.0.0:8457.
Jul  2 12:32:08 fw01B hastd: Connection from tcp4://211.19.53.201:50656 
to tcp4://211.19.53.206:8457.
Jul  2 12:32:08 fw01B hastd: tcp4://211.19.53.201:50656: resource=zfshast
Jul  2 12:32:08 fw01B hastd: [zfshast] (secondary) Outgoing connection 
to tcp4://211.19.53.201:50656 configured.
Jul  2 12:32:08 fw01B hastd: [zfshast] (secondary) hastd_secondary
Jul  2 12:32:08 fw01B hastd: [zfshast] (secondary) calling init_local()
Jul  2 12:32:08 fw01B hastd: [zfshast] (secondary) init_local
Jul  2 12:32:08 fw01B hastd: [zfshast] (secondary) Obtained info about 
/dev/ad4p4.
Jul  2 12:32:08 fw01B hastd: [zfshast] (secondary) Locked /dev/ad4p4.
Jul  2 12:32:08 fw01B hastd: [zfshast] (secondary) inside metadata.c 
res->hr_role !=HAST_ROLE_PRIMAR
Jul  2 12:32:08 fw01B hastd: [zfshast] (secondary) inside mettadata 
secondary_localcnt: 1 secondary_remotecnt: 0
Jul  2 12:32:08 fw01B hastd: [zfshast] (secondary) calling init_remote()
Jul  2 12:32:08 fw01B hastd: [zfshast] (secondary) init_remote()
Jul  2 12:32:08 fw01B hastd: [zfshast] (secondary) humhum secondary 
local 1:  secondary remote 0
Jul  2 12:32:08 fw01B hastd: [zfshast] (secondary) init 
hr_secondary_remotecnt: 0 hr_primary_remotecnt: 0
Jul  2 12:32:08 fw01B hastd: [zfshast] (secondary) secondary_remotecnt 
0, primary_remotecnt 0
Jul  2 12:32:08 fw01B kernel: Jul  2 12:32:08 fw01B hastd: [zfshast] 
(secondary) secondary_remotecnt 0, primary_remotecnt 0
Jul  2 12:32:08 fw01B hastd: [zfshast] (secondary) secondary_localcnt 1, 
primary_localcnt 1
Jul  2 12:32:08 fw01B kernel: Jul  2 12:32:08 fw01B hastd: [zfshast] 
(secondary) secondary_localcnt 1, primary_localcnt 1
Jul  2 12:32:08 fw01B hastd: [zfshast] (secondary) Split-brain detected, 
exiting.
Jul  2 12:32:08 fw01B kernel: Jul  2 12:32:08 fw01B hastd: [zfshast] 
(secondary) Split-brain detected, exiting.
Jul  2 12:32:08 fw01B hastd: [zfshast] (secondary) Worker process exited 
ungracefully (pid=980, exitcode=78).
Jul  2 12:32:08 fw01B kernel: Jul  2 12:32:08 fw01B hastd: [zfshast] 
(secondary) Worker process exited ungracefully (pid=980, exitcode=78).
Jul  2 12:32:13 fw01B hastd: Accepting connection to tcp4://0.0.0.0:8457.

When debuglog working properly.

Jul  2 10:24:10 fw01B hastd: Accepting connection to tcp4://0.0.0.0:8457.
Jul  2 10:24:10 fw01B hastd: tcp4://211.19.53.201:26965: resource=zfshast
Jul  2 10:24:15 fw01B hastd: Accepting connection to tcp4://0.0.0.0:8457.
Jul  2 10:24:15 fw01B hastd: tcp4://211.19.53.201:50280: resource=zfshast
Jul  2 10:24:20 fw01B hastd: Accepting connection to tcp4://0.0.0.0:8457.
Jul  2 10:24:20 fw01B hastd: tcp4://211.19.53.201:27929: resource=zfshast
Jul  2 10:24:25 fw01B hastd: Accepting connection to tcp4://0.0.0.0:8457.
Jul  2 10:24:25 fw01B hastd: tcp4://211.19.53.201:24357: resource=zfshast
Jul  2 10:24:25 fw01B hastd: [zfshast] (secondary) Initial connection 
from tcp4://211.19.53.201:24357.
Jul  2 10:24:25 fw01B hastd: [zfshast] (secondary) Incoming connection 
from tcp4://211.19.53.201:24357 configured.
Jul  2 10:24:25 fw01B hastd: Accepting connection to tcp4://0.0.0.0:8457.
Jul  2 10:24:25 fw01B hastd: tcp4://211.19.53.201:18217: resource=zfshast
Jul  2 10:24:25 fw01B hastd: [zfshast] (secondary) Outgoing connection 
to tcp4://211.19.53.201:18217 configured.
Jul  2 10:24:25 fw01B hastd: [zfshast] (secondary) hastd_secondary
Jul  2 10:24:25 fw01B hastd: [zfshast] (secondary) calling init_local()
Jul  2 10:24:25 fw01B hastd: [zfshast] (secondary) init_local
Jul  2 10:24:25 fw01B hastd: [zfshast] (secondary) Obtained info about 
/dev/ad4p4.
Jul  2 10:24:25 fw01B hastd: [zfshast] (secondary) Locked /dev/ad4p4.
Jul  2 10:24:25 fw01B hastd: [zfshast] (secondary) inside metadata.c 
res->hr_role !=HAST_ROLE_PRIMAR
Jul  2 10:24:25 fw01B hastd: [zfshast] (secondary) inside mettadata 
secondary_localcnt: 0 secondary_remotecnt: 0
Jul  2 10:24:25 fw01B hastd: [zfshast] (secondary) calling init_remote()
Jul  2 10:24:25 fw01B hastd: [zfshast] (secondary) init_remote()
Jul  2 10:24:25 fw01B hastd: [zfshast] (secondary) humhum secondary 
local 0:  secondary remote 0
Jul  2 10:24:25 fw01B hastd: [zfshast] (secondary) init 
hr_secondary_remotecnt: 0 hr_primary_remotecnt: 0
Jul  2 10:24:25 fw01B hastd: [zfshast] (secondary) recv: Taking free 
request.
Jul  2 10:24:25 fw01B hastd: [zfshast] (secondary) disk: Taking request.
Jul  2 10:24:25 fw01B hastd: [zfshast] (secondary) disk: No requests, 
waiting.
Jul  2 10:24:25 fw01B hastd: [zfshast] (secondary) recv: (0x8013ea2e0) 
Got request.
Jul  2 10:24:25 fw01B hastd: [zfshast] (secondary) send: Taking request.
Jul  2 10:24:25 fw01B hastd: [zfshast] (secondary) send: No requests, 
waiting.

Jul  2 10:24:26 fw01B hastd: [zfshast] (secondary) disk: (0x8013ea2e0) 
Moving request to the send queue.
Jul  2 10:24:26 fw01B hastd: [zfshast] (secondary) disk: Taking request.
Jul  2 10:24:26 fw01B hastd: [zfshast] (secondary) disk: No requests, 
waiting.

hastctl role seconary xxx should reset some value of master to backup?

Hope this logs can help you ! If you need to make me debug bit more, 
give me some idea to check!

Thanks

Hiroshi Katayama



More information about the freebsd-fs mailing list