hastd: assertion (res->hr_event != NULL) fails in secondary on split-brain

Pawel Jakub Dawidek pjd at FreeBSD.org
Mon Oct 4 21:37:20 UTC 2010


On Sat, Oct 02, 2010 at 07:26:05PM +0300, Mikolaj Golub wrote:
> Running with this fix another issue is observed. On split-brain `hastctl
> status' on secondary will return "[ERROR] Error 32 received from hastd" most
> of the times. And only for some runs an output will be returned.
> 
> lolek# hastctl status storage
> [ERROR] Error 32 received from hastd.
> lolek# hastctl status storage
> [ERROR] Error 32 received from hastd.
> lolek# hastctl status storage
> storage:
>   role: secondary
>   provname: storage
>   localpath: /dev/ad4
>   extentsize: 2097152
>   keepdirty: 0
>   remoteaddr: tcp4://bolek
>   replication: memsync
>   status: complete
>   dirty: 0 bytes
> lolek# hastctl status storage
> [ERROR] Error 32 received from hastd.
> 
> This is because hastd clears res->hr_workerpid only when a new connection from
> the primary comes. Whilst hastd checks res->hr_workerpid in control_status()
> and if it is not zero it tries to get info from the worker and returns error
> (broken pipe) if the worker is actually not running.
> 
> So it looks like it is better not just to close res->hr_ctrl in main_loop()
> but to do full child cleanup here -- straight away its exit is detected.
> 
> What do you think about the attached patch?

I see three problems:)

1. In child_kill() you interpret status value always, even if it is
   invalid due to earlier errors.
2. While copying the code you changed style. Don't you like style(9)?:)
3. The patch doesn't fix the root cause of the problem.

The real problem also for "hastd: zombies after hooks" you reported was
that sigprocmask(2) doesn't mask ignored signals. In this case SIGCHLD
is ignored by default, so it was never reported. We need to first
install dummy signal handler for SIGCHLD.

The fix I've here (and going to commit after a bit more testing) fixes
zombie hookd you observed completely and makes the window for 'Error 32
received from hastd' problem much smaller. You can still see this
message, because we can send request to child before we know it has
terminated, but it is not as visible as it was before.

Thanks for the report!

-- 
Pawel Jakub Dawidek                       http://www.wheelsystems.com
pjd at FreeBSD.org                           http://www.FreeBSD.org
FreeBSD committer                         Am I Evil? Yes, I Am!
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 196 bytes
Desc: not available
Url : http://lists.freebsd.org/pipermail/freebsd-fs/attachments/20101004/94affc0c/attachment.pgp


More information about the freebsd-fs mailing list