hastd: assertion (res->hr_event != NULL) fails in secondary on split-brain

Sun Oct 3 09:06:54 UTC 2010

On Sat, 02 Oct 2010 19:26:05 +0300 Mikolaj Golub wrote to Mikolaj Golub:

 MG> What do you think about the attached patch?

Here is updated version of the patch :-)

1) I didn't notice previously that in the different parts kill() was called
with different signals. So in the new version I have child_kill(res, sig)
instead of child_kill(res);

2) Testing hastd I faced another issue (which is not related to my patch, and
shows up also with unpatched hastd. It is easy to hang hastd running the
following test:

for i in `jot 1000`; do
        hastctl role primary storage
        hastctl role secondary storage
done

Other host should be configured as secondary and should have some fix for the
problem described initially in this thread (avoid double close of
res->hr_event) so secondary would not crash on assertion.

hastd hangs when switching to secondary, killing primary worker and waiting
for it:

root   1631   0.0  0.5  11244   2364  ??  Is   12:18PM   0:00.37 /sbin/hastd -ddd
root   1869   0.0  7.0  49004  35700  ??  I    12:37PM   0:01.42 hastd: storage (primary) (hastd)
root   1937   0.0  0.3  10924   1764  ??  I    12:37PM   0:00.02 /sbin/hastctl role secondary storage

lolek# gdb /usr/obj/usr/src/sbin/hastd/hastd 1631

[Switching to Thread 28404140 (LWP 100045)]
0x282ba689 in wait4 () from /lib/libc.so.7
(gdb) bt
#0  0x282ba689 in wait4 () from /lib/libc.so.7
#1  0x282912a3 in waitpid () from /lib/libc.so.7
#2  0x280df272 in waitpid () from /lib/libthr.so.3
#3  0x0804c6d4 in control_set_role_common (cfg=0x28419600, nvout=0x2850e0d0, role=3 '\003', 
    res=0x284eb500, name=0x284a3442 "storage", no=0) at /usr/src/sbin/hastd/control.c:115
#4  0x0804d001 in control_handle (cfg=0x28419600) at /usr/src/sbin/hastd/control.c:356
#5  0x08050555 in main_loop () at /usr/src/sbin/hastd/hastd.c:671
#6  0x08050a7f in main (argc=0, argv=0xbfbfed04) at /usr/src/sbin/hastd/hastd.c:784
(gdb) fr 3
#3  0x0804c6d4 in control_set_role_common (cfg=0x28419600, nvout=0x2850e0d0, role=3 '\003', 
    res=0x284eb500, name=0x284a3442 "storage", no=0) at /usr/src/sbin/hastd/control.c:115
115                     } else if (waitpid(res->hr_workerpid, NULL, 0) !=
(gdb) list
110             if (res->hr_workerpid != 0) {
111                     if (kill(res->hr_workerpid, SIGTERM) < 0) {
112                             pjdlog_errno(LOG_WARNING,
113                                 "Unable to kill worker process %u",
114                                 (unsigned int)res->hr_workerpid);
115                     } else if (waitpid(res->hr_workerpid, NULL, 0) !=
116                         res->hr_workerpid) {
117                             pjdlog_errno(LOG_WARNING,
118                                 "Error while waiting for worker process %u",
119                                 (unsigned int)res->hr_workerpid);

It looks like the worker does not die because guard_thread() is sending
CONNECT event to parent and waiting for a response.  

lolek# gdb /usr/obj/usr/src/sbin/hastd/hastd 1869

Thread 8 (Thread 28404140 (LWP 100075)):
#0  0x28301ed7 in recvfrom () from /lib/libc.so.7
#1  0x28287f52 in recv () from /lib/libc.so.7
#2  0x0805f467 in proto_common_recv (fd=10, data=0xbfbfe967 "(", size=5)
    at /usr/src/sbin/hastd/proto_common.c:77
#3  0x0805f8bd in sp_recv (ctx=0x2850e110, data=0xbfbfe967 "(", size=5)
    at /usr/src/sbin/hastd/proto_socketpair.c:185
#4  0x0805ee91 in proto_recv (conn=0x2850e100, data=0xbfbfe967, size=5)
    at /usr/src/sbin/hastd/proto.c:207
#5  0x0804e4ae in hast_proto_recv_hdr (conn=0x2850e100, nvp=0xbfbfe9a4)
    at /usr/src/sbin/hastd/hast_proto.c:308
#6  0x0804dbbc in event_send (res=0x284eb500, event=1) at /usr/src/sbin/hastd/event.c:69
#7  0x0805a411 in init_remote (res=0x284eb500, inp=0xbfbfea34, outp=0xbfbfea30)
    at /usr/src/sbin/hastd/primary.c:661
#8  0x0805e48a in guard_one (res=0x284eb500, ncomp=1) at /usr/src/sbin/hastd/primary.c:1912
#9  0x0805e7e9 in guard_thread (arg=0x284eb500) at /usr/src/sbin/hastd/primary.c:1977
#10 0x0805ac5f in hastd_primary (res=0x284eb500) at /usr/src/sbin/hastd/primary.c:823
#11 0x0804c743 in control_set_role_common (cfg=0x28419600, nvout=0x2850e0d0, role=2 '\002', 
    res=0x284eb500, name=0x284a3442 "storage", no=0) at /usr/src/sbin/hastd/control.c:129
#12 0x0804d001 in control_handle (cfg=0x28419600) at /usr/src/sbin/hastd/control.c:356
#13 0x08050555 in main_loop () at /usr/src/sbin/hastd/hastd.c:671
#14 0x08050a7f in main (argc=0, argv=0xbfbfed04) at /usr/src/sbin/hastd/hastd.c:784

After changing the code so parent closes hr_event and hr_ctrl after kill()
but before wait() the hang has not been observed.

-- 
Mikolaj Golub

-------------- next part --------------
A non-text attachment was scrubbed...
Name: hast.child_kill.patch
Type: text/x-patch
Size: 5097 bytes
Desc: not available
Url : http://lists.freebsd.org/pipermail/freebsd-fs/attachments/20101003/2bdc780f/hast.child_kill.bin