HAST: primary might get stuck when there are connectivity problems with secondary

Sat Apr 24 11:33:58 UTC 2010

On Sat, 24 Apr 2010 09:30:31 +0200 Pawel Jakub Dawidek wrote:

> If secondary is not going to reply, hast_proto_recv_hdr() should
> eventually timeout. On timeout, connection should be closed and this
> requests (and all the others) should be moved to done queue.
>
> It doesn't timeout at all or maybe the timeout is too long? 

After "outage" we have:

on the primary:

tcp4       0      0 172.20.66.201.57596    172.20.66.202.8457     ESTABLISHED
tcp4       0      0 172.20.66.201.41841    172.20.66.202.8457     CLOSED

on the secondary:

tcp4       0      0 172.20.66.202.8457     172.20.66.201.57596    ESTABLISHED
tcp4       0      0 172.20.66.202.8457     172.20.66.201.41841    ESTABLISHED

So one of the connections (used by primary/remote_send_thread()) is broken
(although the secondary is not aware about this, it it in the recv() at that
time) and the second connection (used by primary/remote_recv_thread()) is alive.

It does timeout after net.inet.tcp.keepidle (which is 2 hours by default) when
the secondary starts to send keep alive packets. The secondary receive RST on
its keep alive packet, recv() returns with error and the worker is restarted.

As I wrote in my first letter the workaround is to set net.inet.tcp.keepidle
to some small value on the secondary so it would notice a broken connection
much earlier.

>From the code I don't see how hast_proto_recv_hdr() may timeout if the
connection is alive, have I missed something?

-- 
Mikolaj Golub