HAST: primary might get stuck when there are connectivity problems
with secondary
Mikolaj Golub
to.my.trociny at gmail.com
Wed Apr 21 09:02:24 UTC 2010
Hi,
I can make HAST primary get stuck making the secondary not accessible (network
packets are lost) for some period of time. I run HAST in VirtualBox hosts, so
to emulate network outage I just change bridge interface in VirtualBox
configuration.
Below are details for one example.
On the primary before the outage we have:
sockstat:
root hastd 1571 10 tcp4 172.20.66.201:41841 172.20.66.202:8457
root hastd 1571 11 tcp4 172.20.66.201:57596 172.20.66.202:8457
During the outage and after it sockstat shows the same, while netstat shows:
tcp4 0 0 172.20.66.201.57596 172.20.66.202.8457 ESTABLISHED
tcp4 0 8307 172.20.66.201.41841 172.20.66.202.8457 ESTABLISHED
(note non zero value for send buffer) and then later
tcp4 0 0 172.20.66.201.57596 172.20.66.202.8457 ESTABLISHED
tcp4 0 0 172.20.66.201.41841 172.20.66.202.8457 CLOSED
Restoring network after this changes nothing. Primary gets stuck. No messages
in the log and "dirty" in status output does not change:
[root at hasta ~]# hastctl status
storage:
role: primary
provname: storage
localpath: /dev/ad4
extentsize: 2097152
keepdirty: 64
remoteaddr: 172.20.66.202
replication: memsync
status: complete
dirty: 2097152 bytes
On the secondary we have all this time:
tcp4 0 0 172.20.66.202.8457 172.20.66.201.57596 ESTABLISHED
tcp4 0 0 172.20.66.202.8457 172.20.66.201.41841 ESTABLISHED
The last messages in log:
Apr 21 10:50:21 hasta hastd: [storage] (primary) ggate_recv: (0x28411bc0) Request received from the kernel: READ(13565952, 65536).
Apr 21 10:50:21 hasta hastd: [storage] (primary) ggate_recv: (0x28411bc0) Moving request to the send queue.
Apr 21 10:50:21 hasta hastd: [storage] (primary) ggate_recv: Taking free request.
Apr 21 10:50:21 hasta hastd: [storage] (primary) ggate_recv: (0x28411b80) Got free request.
Apr 21 10:50:21 hasta hastd: [storage] (primary) ggate_recv: (0x28411b80) Waiting for request from the kernel.
Apr 21 10:50:21 hasta hastd: [storage] (primary) local_send: (0x28411bc0) Got request.
Apr 21 10:50:21 hasta hastd: [storage] (primary) local_send: (0x28411bc0) Moving request to the done queue.
Apr 21 10:50:21 hasta hastd: [storage] (primary) local_send: Taking request.
Apr 21 10:50:21 hasta hastd: [storage] (primary) ggate_send: (0x28411bc0) Got request.
Apr 21 10:50:21 hasta hastd: [storage] (primary) ggate_send: (0x28411bc0) Moving request to the free queue.
Apr 21 10:50:21 hasta hastd: [storage] (primary) ggate_send: Taking request.
Apr 21 10:51:00 hasta hastd: [storage] (primary) ggate_recv: (0x28411b80) Request received from the kernel: READ(1812529152, 65536).
Apr 21 10:51:00 hasta hastd: [storage] (primary) ggate_recv: (0x28411b80) Moving request to the send queue.
Apr 21 10:51:00 hasta hastd: [storage] (primary) ggate_recv: Taking free request.
Apr 21 10:51:00 hasta hastd: [storage] (primary) ggate_recv: (0x28411b00) Got free request.
Apr 21 10:51:00 hasta hastd: [storage] (primary) ggate_recv: (0x28411b00) Waiting for request from the kernel.
Apr 21 10:51:00 hasta hastd: [storage] (primary) local_send: (0x28411b80) Got request.
Apr 21 10:51:00 hasta hastd: [storage] (primary) local_send: (0x28411b80) Moving request to the done queue.
Apr 21 10:51:00 hasta hastd: [storage] (primary) local_send: Taking request.
Apr 21 10:51:00 hasta hastd: [storage] (primary) ggate_send: (0x28411b80) Got request.
Apr 21 10:51:00 hasta hastd: [storage] (primary) ggate_send: (0x28411b80) Moving request to the free queue.
Apr 21 10:51:00 hasta hastd: [storage] (primary) ggate_send: Taking request.
The backtrace of gotten stuck hastd is in the attach.
I interpret this in the following way. Although the network is down
hast_proto_send() in remote_send_thread() returns success (sent data are
stored in the kernel buffer). Then kernel tries to send data and eventually
fails after timeout and close the socket. hastd is not aware about this,
remote_send_thread() is blocked in "Taking request" at this time, sync thread
is waiting for status from the secondary about sent data but secondary does
not send it because it did not receive any data.
Restarting hastd on the secondary usually helps. A workaround is to set
net.inet.tcp.keepidle to some small value (e.g. 300 sec) on the
secondary. Then the secondary will notice much earlier that the peer has
closed the connection and will restart the worker itself:
Apr 21 11:52:21 hastb hastd: [storage] (secondary) Unable to receive request header: Connection reset by peer.
Apr 21 11:52:21 hastb hastd: [storage] (secondary) Worker process (pid=1398) exited ungracefully: status=19200.
--
Mikolaj Golub
-------------- next part --------------
A non-text attachment was scrubbed...
Name: bt.log
Type: application/octet-stream
Size: 5719 bytes
Desc: not available
Url : http://lists.freebsd.org/pipermail/freebsd-fs/attachments/20100421/fc3f46ac/bt.obj
More information about the freebsd-fs
mailing list