HAST instability
Daniel Kalchev
daniel at digsys.bg
Tue May 31 12:51:18 UTC 2011
On 30.05.11 21:42, Mikolaj Golub wrote:
> DK> One strange thing is that there is never established TCP connection
> DK> between both nodes:
>
> DK> tcp4 0 0 10.2.101.11.48939 10.2.101.12.8457 FIN_WAIT_2
> DK> tcp4 0 1288 10.2.101.11.57008 10.2.101.12.8457 CLOSE_WAIT
> DK> tcp4 0 0 10.2.101.11.46346 10.2.101.12.8457 FIN_WAIT_2
> DK> tcp4 0 90648 10.2.101.11.13916 10.2.101.12.8457 CLOSE_WAIT
> DK> tcp4 0 0 10.2.101.11.8457 *.* LISTEN
>
> It is normal. hastd uses the connections only in one direction so it calls
> shutdown to close unused directions.
So the TCP connections are all too short-lived that I can never see a
single one in ESTABLISHED state? 10Gbit Ethernet is indeed fast, so this
might well be possible...
> I suppose when checksum is enabled the bottleneck is cpu, the triffic rate is lower and the problem is not triggered.
I was thinking something like this. My later tests seems to suggest that
when the network transfer rate is mugh higher than disk transfer rate
this gets triggered.
> "Hash mismatch" message suggests that actually you were using checksum then,
> weren't you?
Yes, this occurs only when checksums are enabled. Happens with both
crc32 and sha256.
> I would like to look at full logs for some rather large period, with several
> cases, from both primary and secondary (and be sure about synchronized time).
I have made sure clocks are synchronized and am currently running on a
freshly rebooted nodes (with two additional SATA drives at each node) --
so far some interesting findings, like I get hash errors and
disconnects much more frequent now. Will post when an bonnie++ run on
the ZFS filesystem on top of the HAST resources finishes.
> Also, it might worth checking that there is no network packet corruption (some strange things in netstat -di, netstat -s, may be copying large files via net and comparing checksums).
>
I will post these as well, however so far no indication of any network
problems was seen, no interface errors etc. Might be also the ix driver
is not reporting such, of course.
One additional note: while playing with this setup, I tried to simulate
local disk going away in the hope HAST will switch to using the remote
disk. Instead of asking someone at the site to pull out the drive, I
just issued on the primary
hastctl role init data0
which resulted in kernel panic. Unfortunately, there was no sufficient
dump space for 48GB. I will re-run this again with more drives for the
crash dump. Anything you want me to look for in particular? (kernels
have no KDB compiled in yet)
Daniel
More information about the freebsd-stable
mailing list