HAST: primary might get stuck when there are connectivity
problems with secondary
Mikolaj Golub
to.my.trociny at gmail.com
Sat Apr 24 11:33:58 UTC 2010
On Sat, 24 Apr 2010 09:30:31 +0200 Pawel Jakub Dawidek wrote:
> If secondary is not going to reply, hast_proto_recv_hdr() should
> eventually timeout. On timeout, connection should be closed and this
> requests (and all the others) should be moved to done queue.
>
> It doesn't timeout at all or maybe the timeout is too long?
After "outage" we have:
on the primary:
tcp4 0 0 172.20.66.201.57596 172.20.66.202.8457 ESTABLISHED
tcp4 0 0 172.20.66.201.41841 172.20.66.202.8457 CLOSED
on the secondary:
tcp4 0 0 172.20.66.202.8457 172.20.66.201.57596 ESTABLISHED
tcp4 0 0 172.20.66.202.8457 172.20.66.201.41841 ESTABLISHED
So one of the connections (used by primary/remote_send_thread()) is broken
(although the secondary is not aware about this, it it in the recv() at that
time) and the second connection (used by primary/remote_recv_thread()) is alive.
It does timeout after net.inet.tcp.keepidle (which is 2 hours by default) when
the secondary starts to send keep alive packets. The secondary receive RST on
its keep alive packet, recv() returns with error and the worker is restarted.
As I wrote in my first letter the workaround is to set net.inet.tcp.keepidle
to some small value on the secondary so it would notice a broken connection
much earlier.
>From the code I don't see how hast_proto_recv_hdr() may timeout if the
connection is alive, have I missed something?
--
Mikolaj Golub
More information about the freebsd-fs
mailing list