hastd: assertion (res->hr_event != NULL) fails in secondary on
split-brain
Mikolaj Golub
to.my.trociny at gmail.com
Sat Oct 2 16:26:07 UTC 2010
On Sat, 02 Oct 2010 15:20:58 +0300 Mikolaj Golub wrote:
MG> After recent changes in hastd (I think r213006: Fix descriptor leaks) if
MG> split-brain occurs hastd will abort in child_cleanup() on assertion
MG> (res->hr_event != NULL).
...
MG> So we have double close of res->hr_event. The first time it is closed when
MG> parent detects that worker exited in main_loop(), and the second time when a
MG> new connection from primary comes and the parent does cleanup after previously
MG> terminated child before starting new one.
MG> The straightforward fix is to check res->hr_event before closing, like in the
MG> patch below.
MG> --
MG> Mikolaj Golub
MG> Index: sbin/hastd/control.c
MG> ===================================================================
MG> --- sbin/hastd/control.c (revision 213357)
MG> +++ sbin/hastd/control.c (working copy)
MG> @@ -58,8 +58,10 @@ child_cleanup(struct hast_resource *res)
MG>
MG> proto_close(res->hr_ctrl);
MG> res->hr_ctrl = NULL;
MG> - proto_close(res->hr_event);
MG> - res->hr_event = NULL;
MG> + if (res->hr_event != NULL) {
MG> + proto_close(res->hr_event);
MG> + res->hr_event = NULL;
MG> + }
MG> res->hr_workerpid = 0;
MG> }
MG>
Running with this fix another issue is observed. On split-brain `hastctl
status' on secondary will return "[ERROR] Error 32 received from hastd" most
of the times. And only for some runs an output will be returned.
lolek# hastctl status storage
[ERROR] Error 32 received from hastd.
lolek# hastctl status storage
[ERROR] Error 32 received from hastd.
lolek# hastctl status storage
storage:
role: secondary
provname: storage
localpath: /dev/ad4
extentsize: 2097152
keepdirty: 0
remoteaddr: tcp4://bolek
replication: memsync
status: complete
dirty: 0 bytes
lolek# hastctl status storage
[ERROR] Error 32 received from hastd.
This is because hastd clears res->hr_workerpid only when a new connection from
the primary comes. Whilst hastd checks res->hr_workerpid in control_status()
and if it is not zero it tries to get info from the worker and returns error
(broken pipe) if the worker is actually not running.
So it looks like it is better not just to close res->hr_ctrl in main_loop()
but to do full child cleanup here -- straight away its exit is detected.
What do you think about the attached patch?
--
Mikolaj Golub
-------------- next part --------------
A non-text attachment was scrubbed...
Name: hast.child_kill.patch
Type: text/x-patch
Size: 5131 bytes
Desc: not available
Url : http://lists.freebsd.org/pipermail/freebsd-fs/attachments/20101002/9938d0fc/hast.child_kill.bin
More information about the freebsd-fs
mailing list