HAST: primary might get stuck when there are connectivity problems with secondary

Mikolaj Golub to.my.trociny at gmail.com
Thu Apr 29 11:23:13 UTC 2010


On Thu, 29 Apr 2010 10:12:00 +0200 Pawel Jakub Dawidek wrote:

 PJD> On Thu, Apr 29, 2010 at 11:03:33AM +0300, Mikolaj Golub wrote:
 >> 
 >> On Wed, 28 Apr 2010 23:46:36 +0200 Pawel Jakub Dawidek wrote:
 >> 
 >>  PJD> Could you see if the following patch fixes the problem for you:
 >> 
 >>  PJD>         http://people.freebsd.org/~pjd/patches/hastd_timeout.patch
 >> 
 >>  PJD> The patch sets timeout on both incoming and outgoing sockets on primary
 >>  PJD> and on outgoing socket on secondary. Incoming socket on secondary is
 >>  PJD> left with no timeout to avoid problem you described above.
 >> 
 >> The patch works for me.
 >> 
 >> After disabling the network connection between the primary and the secondary
 >> FS operations on the primary do not get stuck and the following messages are
 >> observed:
 >> 
 >> Apr 29 10:37:41 hasta hastd: [storage] (primary) Unable to receive reply header: Resource temporarily unavailable.
 >> Apr 29 10:37:57 hasta hastd: [tank] (primary) Unable to receive reply header: Resource temporarily unavailable.
 >> Apr 29 10:37:57 hasta hastd: [tank] (primary) Unable to send request (Resource temporarily unavailable): WRITE(972292096, 14336).
 >> Apr 29 10:38:56 hasta hastd: [storage] (primary) Unable to connect to 172.20.66.202: Operation timed out.
 >> Apr 29 10:39:12 hasta hastd: [tank] (primary) Unable to connect to 172.20.66.202: Operation timed out.
 >> 
 >> After restoring the network connection the primary reconnects to the secondary
 >> and the status changes back from "degraded" to "complete".

 PJD> Good. And I assume you don't observe problems on secondary? Eg. recv(2)
 PJD> on secondary doesn't timeout?

No problems on secondary. When emulating a network outage, after connectivity
restoring the worker is restarted when new connections comes from primary:

Apr 29 14:12:39 hastb hastd: Accepting connection to tcp4://0.0.0.0:8457.
Apr 29 14:12:39 hastb hastd: Connection from tcp4://172.20.66.202:8457 to tcp4://172.20.66.201:44508.
Apr 29 14:12:39 hastb hastd: tcp4://172.20.66.201:44508: resource=tank
Apr 29 14:12:39 hastb hastd: [tank] (secondary) Initial connection from tcp4://172.20.66.201:44508.
Apr 29 14:12:39 hastb hastd: [tank] (secondary) Worker process exists (pid=1729), stopping it.
Apr 29 14:12:39 hastb hastd: [tank] (secondary) Worker process (pid=1729) exited gracefully.
Apr 29 14:12:39 hastb hastd: [tank] (secondary) Incoming connection from tcp4://172.20.66.201:44508 configured.

If the FS is idle (there is no I/O) secondary is waiting in receive, does not
timeout and does not stop workers (as it was with my timeout patch).

-- 
Mikolaj Golub


More information about the freebsd-fs mailing list