HAST instability
Daniel Kalchev
daniel at digsys.bg
Mon May 30 14:43:16 UTC 2011
Some further investigation:
The HAST nodes do not disconnect when checksum is enabled (either crc32
or sha256).
One strange thing is that there is never established TCP connection
between both nodes:
tcp4 0 0 10.2.101.11.48939 10.2.101.12.8457 FIN_WAIT_2
tcp4 0 1288 10.2.101.11.57008 10.2.101.12.8457 CLOSE_WAIT
tcp4 0 0 10.2.101.11.46346 10.2.101.12.8457 FIN_WAIT_2
tcp4 0 90648 10.2.101.11.13916 10.2.101.12.8457 CLOSE_WAIT
tcp4 0 0 10.2.101.11.8457 *.* LISTEN
When using sha256 one CPU core is 100% utilized by each hastd process,
while 70-80MB/sec per HAST resource is being transferred (total of up to
140 MB/sec traffic for both);
When using crc32 each CPU core is at 22% utilization;
When using none as checksum, CPU usage is under 10%
Eventually after many hours, got corrupted communication:
May 30 17:32:35 b1b hastd[9827]: [data0] (secondary) Hash mismatch.
May 30 17:32:35 b1b hastd[9827]: [data0] (secondary) Unable to receive
request data: No such file or directory.
May 30 17:32:38 b1b hastd[9397]: [data0] (secondary) Worker process
exited ungracefully (pid=9827, exitcode=75).
and
May 30 17:32:27 b1a hastd[1837]: [data0] (primary) Unable to receive
reply header: Operation timed out.
May 30 17:32:30 b1a hastd[1837]: [data0] (primary) Disconnected from
10.2.101.12.
May 30 17:32:30 b1a hastd[1837]: [data0] (primary) Unable to send
request (Broken pipe): WRITE(99128470016, 131072).
Daniel
More information about the freebsd-stable
mailing list