HAST: primary might get stuck when there are connectivity problems with secondary

Sun Apr 25 11:17:28 UTC 2010

On Sat, 24 Apr 2010 14:33:53 +0300 Mikolaj Golub wrote:

> From the code I don't see how hast_proto_recv_hdr() may timeout if the
> connection is alive, have I missed something?

I did some experiments adding the code that sets SO_RCVTIMEO socket option
(see the attached patch). It fixes this issue. After timeout the worker on the
secondary is restarted with the error:

Apr 25 13:06:45 hastb hastd: [storage] (secondary) Unable to receive request header: Resource temporarily unavailable.
Apr 25 13:06:45 hastb hastd: [storage] (secondary) Worker process (pid=1243) exited ungracefully: status=19200.

On the other hand when the FS is idle (there is no I/O at all) we have the
worker restart too and the primary is not being connected to the secondary
until some I/O appears. So it might look not very nicely :-)

Also note, I had to modify proto_common_recv() to have timeout working. After
timeout recv() sets errno to EWOULDBLOCK, which has the same number as EAGAIN
in FreeBSD. The current proto_common_recv() restarts recv() if EAGAIN is
returned.

-- 
Mikolaj Golub

-------------- next part --------------
A non-text attachment was scrubbed...
Name: hastd.proto_tcp4.c.SO_RCVTIMEO.patch
Type: text/x-diff
Size: 2192 bytes
Desc: not available
Url : http://lists.freebsd.org/pipermail/freebsd-fs/attachments/20100425/c1573111/hastd.proto_tcp4.c.SO_RCVTIMEO.bin