HAST instability
Mikolaj Golub
trociny at freebsd.org
Mon May 30 18:42:28 UTC 2011
On Mon, 30 May 2011 17:43:04 +0300 Daniel Kalchev wrote:
DK> Some further investigation:
DK> The HAST nodes do not disconnect when checksum is enabled (either
DK> crc32 or sha256).
DK> One strange thing is that there is never established TCP connection
DK> between both nodes:
DK> tcp4 0 0 10.2.101.11.48939 10.2.101.12.8457 FIN_WAIT_2
DK> tcp4 0 1288 10.2.101.11.57008 10.2.101.12.8457 CLOSE_WAIT
DK> tcp4 0 0 10.2.101.11.46346 10.2.101.12.8457 FIN_WAIT_2
DK> tcp4 0 90648 10.2.101.11.13916 10.2.101.12.8457 CLOSE_WAIT
DK> tcp4 0 0 10.2.101.11.8457 *.* LISTEN
It is normal. hastd uses the connections only in one direction so it calls
shutdown to close unused directions.
DK> When using sha256 one CPU core is 100% utilized by each hastd process,
DK> while 70-80MB/sec per HAST resource is being transferred (total of up
DK> to 140 MB/sec traffic for both);
DK> When using crc32 each CPU core is at 22% utilization;
DK> When using none as checksum, CPU usage is under 10%
I suppose when checksum is enabled the bottleneck is cpu, the triffic rate is
lower and the problem is not triggered.
DK> Eventually after many hours, got corrupted communication:
DK> May 30 17:32:35 b1b hastd[9827]: [data0] (secondary) Hash mismatch.
"Hash mismatch" message suggests that actually you were using checksum then,
weren't you?
DK> May 30 17:32:35 b1b hastd[9827]: [data0] (secondary) Unable to receive
DK> request data: No such file or directory.
DK> May 30 17:32:38 b1b hastd[9397]: [data0] (secondary) Worker process
DK> exited ungracefully (pid=9827, exitcode=75).
DK> and
DK> May 30 17:32:27 b1a hastd[1837]: [data0] (primary) Unable to receive
DK> reply header: Operation timed out.
DK> May 30 17:32:30 b1a hastd[1837]: [data0] (primary) Disconnected from
DK> 10.2.101.12.
DK> May 30 17:32:30 b1a hastd[1837]: [data0] (primary) Unable to send
DK> request (Broken pipe): WRITE(99128470016, 131072).
It looks a little different than in your fist message.
Do you have clock in sync on both nodes?
I would like to look at full logs for some rather large period, with several
cases, from both primary and secondary (and be sure about synchronized time).
Also, it might worth checking that there is no network packet corruption (some
strange things in netstat -di, netstat -s, may be copying large files via net
and comparing checksums).
--
Mikolaj Golub
More information about the freebsd-stable
mailing list