hastd: assertion (res->hr_event != NULL) fails in secondary on split-brain

Sat Oct 2 16:26:07 UTC 2010

On Sat, 02 Oct 2010 15:20:58 +0300 Mikolaj Golub wrote:

 MG> After recent changes in hastd (I think r213006: Fix descriptor leaks) if
 MG> split-brain occurs hastd will abort in child_cleanup() on assertion
 MG> (res->hr_event != NULL).
 ...
 MG> So we have double close of res->hr_event. The first time it is closed when
 MG> parent detects that worker exited in main_loop(), and the second time when a
 MG> new connection from primary comes and the parent does cleanup after previously
 MG> terminated child before starting new one.

 MG> The straightforward fix is to check res->hr_event before closing, like in the
 MG> patch below.

 MG> -- 
 MG> Mikolaj Golub

 MG> Index: sbin/hastd/control.c
 MG> ===================================================================
 MG> --- sbin/hastd/control.c        (revision 213357)
 MG> +++ sbin/hastd/control.c        (working copy)
 MG> @@ -58,8 +58,10 @@ child_cleanup(struct hast_resource *res)
 MG>  
 MG>          proto_close(res->hr_ctrl);
 MG>          res->hr_ctrl = NULL;
 MG> -        proto_close(res->hr_event);
 MG> -        res->hr_event = NULL;
 MG> +        if (res->hr_event != NULL) {
 MG> +                proto_close(res->hr_event);
 MG> +                res->hr_event = NULL;
 MG> +        }
 MG>          res->hr_workerpid = 0;
 MG>  }
 MG>  

Running with this fix another issue is observed. On split-brain `hastctl
status' on secondary will return "[ERROR] Error 32 received from hastd" most
of the times. And only for some runs an output will be returned.

lolek# hastctl status storage
[ERROR] Error 32 received from hastd.
lolek# hastctl status storage
[ERROR] Error 32 received from hastd.
lolek# hastctl status storage
storage:
  role: secondary
  provname: storage
  localpath: /dev/ad4
  extentsize: 2097152
  keepdirty: 0
  remoteaddr: tcp4://bolek
  replication: memsync
  status: complete
  dirty: 0 bytes
lolek# hastctl status storage
[ERROR] Error 32 received from hastd.

This is because hastd clears res->hr_workerpid only when a new connection from
the primary comes. Whilst hastd checks res->hr_workerpid in control_status()
and if it is not zero it tries to get info from the worker and returns error
(broken pipe) if the worker is actually not running.

So it looks like it is better not just to close res->hr_ctrl in main_loop()
but to do full child cleanup here -- straight away its exit is detected.

What do you think about the attached patch?

-- 
Mikolaj Golub

-------------- next part --------------
A non-text attachment was scrubbed...
Name: hast.child_kill.patch
Type: text/x-patch
Size: 5131 bytes
Desc: not available
Url : http://lists.freebsd.org/pipermail/freebsd-fs/attachments/20101002/9938d0fc/hast.child_kill.bin