hastd: possible race when a worker is starting
Pawel Jakub Dawidek
pjd at FreeBSD.org
Mon Sep 20 13:25:14 UTC 2010
On Sun, Sep 19, 2010 at 12:51:54PM +0300, Mikolaj Golub wrote:
> Hi,
>
> After recent changes in hastd we have the following scenario when a worker
> process is started:
>
> 1) worker process is created;
> 2) remote connections are established and CONNECT event is sent to parent;
> 3) threads (among them control thread) are started.
>
> Consider a sitution when hastctl request is recevied by parent between (1) and
> (2). The parent sees that worker is running and sends it a request and waits
> for a response. The request should be processed by control thread, but it is
> not started yet. Instead the worker sends CONNECT event and waits for a
> response from the parent. We have a deadlock.
>
> I was able to reproduce this scenario running a script that runs 'hastctl
> status' requests in background and periodically kills the worker to start a
> new one:
>
> ----------------------------------------------------------------------
>
> #!/bin/sh
>
> hastctl role primary storage
>
> while true; do
> hastctl status storage > /dev/null
> done &
>
> while sleep 1; do
> pid=`ps auxww | awk '/hastd: storage \(primary\)/ {print $2}'`
> if [ -n "${pid}" ]; then
> kill "${pid}"
> sleep 1;
> npid=`ps auxww | awk '/hastd: storage \(primary\)/ {print $2}'`
> if [ "${pid}" = "${npid}" ]; then
> echo "Process ${pid} was not killed" >&2
> exit 1
> fi
> fi
> done
>
> ----------------------------------------------------------------------
>
> [root at lolek ~]# ./test_event2.sh
> Process 17559 was not killed
>
> [root at lolek ~]# ps auxww |grep hast
> root 17563 2.0 0.3 10924 1764 1 S 1:03PM 0:00.01 hastctl status storage
> root 3334 0.0 0.5 11244 2488 ?? Ss 12:13PM 0:02.06 /sbin/hastd -ddd
> root 17559 0.0 0.5 11244 2520 ?? S 1:03PM 0:00.01 hastd: storage (primary) (hastd)
> root 17561 0.0 0.3 10924 1764 1 S 1:03PM 0:00.01 hastctl status storage
>
> [root at lolek ~]# gdb /usr/obj/usr/src/sbin/hastd/hastd 3334
>
> (gdb) bt
> #0 0x28300ed5 in recvfrom () from /lib/libc.so.7
> #1 0x28286f52 in recv () from /lib/libc.so.7
> #2 0x0805f237 in proto_common_recv (fd=41, data=0xbfbfeaa7 "(", size=5)
> at /usr/src/sbin/hastd/proto_common.c:77
> #3 0x0805f68d in sp_recv (ctx=0x2850e550, data=0xbfbfeaa7 "(", size=5)
> at /usr/src/sbin/hastd/proto_socketpair.c:185
> #4 0x0805ec61 in proto_recv (conn=0x2850e540, data=0xbfbfeaa7, size=5)
> at /usr/src/sbin/hastd/proto.c:207
> #5 0x0804e42e in hast_proto_recv_hdr (conn=0x2850e540, nvp=0xbfbfeae4)
> at /usr/src/sbin/hastd/hast_proto.c:308
> #6 0x0804c83f in control_status_worker (res=0x284eb500, nvout=0x2850e4d0, no=0)
> at /usr/src/sbin/hastd/control.c:160
> #7 0x0804cc9f in control_status (cfg=0x28419600, nvout=0x2850e4d0, res=0x284eb500,
> name=0x2851042c "storage", no=0) at /usr/src/sbin/hastd/control.c:248
> #8 0x0804cfbb in control_handle (cfg=0x28419600) at /usr/src/sbin/hastd/control.c:348
> #9 0x08050357 in main_loop () at /usr/src/sbin/hastd/hastd.c:682
> #10 0x0805076d in main (argc=0, argv=0xbfbfecd0) at /usr/src/sbin/hastd/hastd.c:792
>
> [root at lolek ~]# gdb /usr/obj/usr/src/sbin/hastd/hastd 17559
>
> (gdb) bt
> #0 0x28300ed5 in recvfrom () from /lib/libc.so.7
> #1 0x28286f52 in recv () from /lib/libc.so.7
> #2 0x0805f237 in proto_common_recv (fd=43, data=0xbfbfe9e7 "(", size=5)
> at /usr/src/sbin/hastd/proto_common.c:77
> #3 0x0805f68d in sp_recv (ctx=0x2850e570, data=0xbfbfe9e7 "(", size=5)
> at /usr/src/sbin/hastd/proto_socketpair.c:185
> #4 0x0805ec61 in proto_recv (conn=0x2850e560, data=0xbfbfe9e7, size=5)
> at /usr/src/sbin/hastd/proto.c:207
> #5 0x0804e42e in hast_proto_recv_hdr (conn=0x2850e560, nvp=0xbfbfea24)
> at /usr/src/sbin/hastd/hast_proto.c:308
> #6 0x0804db3c in event_send (res=0x284eb500, event=1) at /usr/src/sbin/hastd/event.c:69
> #7 0x0805a1d1 in init_remote (res=0x284eb500, inp=0x0, outp=0x0)
> at /usr/src/sbin/hastd/primary.c:671
> #8 0x0805a7f4 in hastd_primary (res=0x284eb500) at /usr/src/sbin/hastd/primary.c:810
> #9 0x0804c6d5 in control_set_role_common (cfg=0x28419600, nvout=0x2850e4d0, role=2 '\002',
> res=0x284eb500, name=0x28510442 "storage", no=0) at /usr/src/sbin/hastd/control.c:117
> #10 0x0804cf91 in control_handle (cfg=0x28419600) at /usr/src/sbin/hastd/control.c:344
> #11 0x08050357 in main_loop () at /usr/src/sbin/hastd/hastd.c:682
> #12 0x0805076d in main (argc=0, argv=0xbfbfecd0) at /usr/src/sbin/hastd/hastd.c:792
>
> This is for primary worker but I believe the same issue is with secondary.
>
> The fix would be e.g. to start control thread before sending any event.
Thanks for the report and analysis. Could you try this patch:
http://people.freebsd.org/~pjd/patches/hastd.patch
--
Pawel Jakub Dawidek http://www.wheelsystems.com
pjd at FreeBSD.org http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 196 bytes
Desc: not available
Url : http://lists.freebsd.org/pipermail/freebsd-fs/attachments/20100920/35fc5b3d/attachment.pgp
More information about the freebsd-fs
mailing list