HAST: primary might get stuck when there are connectivity problems with secondary

Pawel Jakub Dawidek pjd at FreeBSD.org
Sat Apr 24 07:30:43 UTC 2010


On Fri, Apr 23, 2010 at 04:00:05PM +0300, Mikolaj Golub wrote:
> This does not look for me just "confusing" :-). When I wrote that hast had
> gotten stuck I meant that all operations with FS above hast (ZFS in my case)
> had hanged and this lasted until hastd on the secondary was restarted.

Ok, I see.

> May be I miss something but I think the following happens when there is
> network outage:
> 
> remote_send_thread():
> 
> 1) the request is moved to recv queue;
> 
> 2) the request is sent to secondary calling hast_proto_send();
> 
> 3) hast_proto_send() returns OK although data are not sent yet (they are
> buffered by kernel) so the request is not removed from recv queue and not
> moved immediately to done queue as it would be in the case when
> hast_proto_send() returned error.
> 
> The kernel is trying to send the request for some time, fails by timeout and
> close the connection.
> 
> remote_recv_thread():
> 
> 1) takes the request from recv queue;
> 
> 2) waits for the respond from the secondary in hast_proto_recv_hdr() which
> will never come because the secondary has not received the request and has
> nothing to reply.

If secondary is not going to reply, hast_proto_recv_hdr() should
eventually timeout. On timeout, connection should be closed and this
requests (and all the others) should be moved to done queue.

It doesn't timeout at all or maybe the timeout is too long? 

> So we have a request or several gotten stuck in in the recv queue. It looks
> like FS is waiting for the response on this request(s) sent by
> ggate_send_thread() and gets stuck.
> 
> Below are logs for a similar example I made today:
> 
> The last messages before "outage":
> 
> Apr 23 14:10:20 hasta hastd: [storage] (primary) ggate_send: (0x28411c40) Got request.
> Apr 23 14:10:20 hasta hastd: [storage] (primary) ggate_send: (0x28411c40) Moving request to the free queue.
> Apr 23 14:10:20 hasta hastd: [storage] (primary) ggate_send: Taking request.
> 
> During "outage":
> 
> Apr 23 14:10:49 hasta hastd: [storage] (primary) ggate_recv: (0x28411c80) Request received from the kernel: WRITE(1925971968, 131072).
> Apr 23 14:10:49 hasta hastd: [storage] (primary) ggate_recv: (0x28411c80) Moving request to the send queues.
> Apr 23 14:10:49 hasta hastd: [storage] (primary) ggate_recv: Taking free request.
> Apr 23 14:10:49 hasta hastd: [storage] (primary) ggate_recv: (0x28411cc0) Got free request.
> Apr 23 14:10:49 hasta hastd: [storage] (primary) ggate_recv: (0x28411cc0) Waiting for request from the kernel.
> Apr 23 14:10:49 hasta hastd: [storage] (primary) ggate_recv: (0x28411cc0) Request received from the kernel: WRITE(1925348352, 1024).
> Apr 23 14:10:49 hasta hastd: [storage] (primary) ggate_recv: (0x28411cc0) Moving request to the send queues.
> Apr 23 14:10:49 hasta hastd: [storage] (primary) ggate_recv: Taking free request.
> Apr 23 14:10:49 hasta hastd: [storage] (primary) ggate_recv: (0x28411c00) Got free request.
> Apr 23 14:10:49 hasta hastd: [storage] (primary) ggate_recv: (0x28411c00) Waiting for request from the kernel.
> Apr 23 14:10:49 hasta hastd: [storage] (primary) ggate_recv: (0x28411c00) Request received from the kernel: WRITE(922860544, 1024).
> Apr 23 14:10:49 hasta hastd: [storage] (primary) local_send: (0x28411c80) Got request.
> Apr 23 14:10:49 hasta hastd: [storage] (primary) remote_send: (0x28411c80) Got request.
> Apr 23 14:10:49 hasta hastd: [storage] (primary) remote_send: (0x28411c80) Moving request to the recv queue.
> Apr 23 14:10:49 hasta hastd: [storage] (primary) remote_send: Taking request.
> Apr 23 14:10:49 hasta hastd: [storage] (primary) remote_send: (0x28411cc0) Got request.
> Apr 23 14:10:49 hasta hastd: [storage] (primary) remote_send: (0x28411cc0) Moving request to the recv queue.
> Apr 23 14:10:49 hasta hastd: [storage] (primary) ggate_recv: (0x28411c00) Moving request to the send queues.
> Apr 23 14:10:49 hasta hastd: [storage] (primary) ggate_recv: Taking free request.
> Apr 23 14:10:49 hasta hastd: [storage] (primary) ggate_recv: (0x28411bc0) Got free request.
> Apr 23 14:10:49 hasta hastd: [storage] (primary) ggate_recv: (0x28411bc0) Waiting for request from the kernel.
> Apr 23 14:10:49 hasta hastd: [storage] (primary) ggate_recv: (0x28411bc0) Request received from the kernel: WRITE(1926103040, 131072).
> Apr 23 14:10:49 hasta hastd: [storage] (primary) ggate_recv: (0x28411bc0) Moving request to the send queues.
> Apr 23 14:10:49 hasta hastd: [storage] (primary) ggate_recv: Taking free request.
> Apr 23 14:10:49 hasta hastd: [storage] (primary) ggate_recv: (0x28411b80) Got free request.
> Apr 23 14:10:49 hasta hastd: [storage] (primary) ggate_recv: (0x28411b80) Waiting for request from the kernel.
> Apr 23 14:10:49 hasta hastd: [storage] (primary) local_send: Taking request.
> Apr 23 14:10:49 hasta hastd: [storage] (primary) local_send: (0x28411cc0) Got request.
> Apr 23 14:10:49 hasta hastd: [storage] (primary) local_send: Taking request.
> Apr 23 14:10:49 hasta hastd: [storage] (primary) local_send: (0x28411c00) Got request.
> Apr 23 14:10:49 hasta hastd: [storage] (primary) local_send: Taking request.
> Apr 23 14:10:49 hasta hastd: [storage] (primary) local_send: (0x28411bc0) Got request.
> Apr 23 14:10:49 hasta hastd: [storage] (primary) local_send: Taking request.
> 
> During "outage" two requests 0x28411c80 and 0x28411cc0 was received,
> remote_send reported that it had succesfully sent the requests and moved them
> to recv queue. And the requests got stuck there

IMHO hast_proto_recv_hdr() should timeout, but I don't change default
timeout, which might be way too long.

-- 
Pawel Jakub Dawidek                       http://www.wheelsystems.com
pjd at FreeBSD.org                           http://www.FreeBSD.org
FreeBSD committer                         Am I Evil? Yes, I Am!
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 196 bytes
Desc: not available
Url : http://lists.freebsd.org/pipermail/freebsd-fs/attachments/20100424/07995122/attachment.pgp


More information about the freebsd-fs mailing list