HAST instability

Tue May 31 12:51:18 UTC 2011

On 30.05.11 21:42, Mikolaj Golub wrote:
>   DK>  One strange thing is that there is never established TCP connection
>   DK>  between both nodes:
>
>   DK>  tcp4       0      0 10.2.101.11.48939      10.2.101.12.8457       FIN_WAIT_2
>   DK>  tcp4       0   1288 10.2.101.11.57008      10.2.101.12.8457       CLOSE_WAIT
>   DK>  tcp4       0      0 10.2.101.11.46346      10.2.101.12.8457       FIN_WAIT_2
>   DK>  tcp4       0  90648 10.2.101.11.13916      10.2.101.12.8457       CLOSE_WAIT
>   DK>  tcp4       0      0 10.2.101.11.8457       *.*                    LISTEN
>
> It is normal. hastd uses the connections only in one direction so it calls
> shutdown to close unused directions.
So the TCP connections are all too short-lived that I can never see a 
single one in ESTABLISHED state? 10Gbit Ethernet is indeed fast, so this 
might well be possible...
> I suppose when checksum is enabled the bottleneck is cpu, the triffic rate is lower and the problem is not triggered.
I was thinking something like this. My later tests seems to suggest that 
when the network transfer rate is mugh higher than disk transfer rate 
this gets triggered.

> "Hash mismatch" message suggests that actually you were using checksum then,
> weren't you?
Yes, this occurs only when checksums are enabled. Happens with both 
crc32 and sha256.
> I would like to look at full logs for some rather large period, with several
> cases, from both primary and secondary (and be sure about synchronized time).
I have made sure clocks are synchronized and am currently running on a 
freshly rebooted nodes (with two additional SATA drives at each node) -- 
so far some interesting findings, like  I get hash errors and 
disconnects much more frequent now. Will post when an bonnie++ run on 
the ZFS filesystem on top of the HAST resources finishes.
> Also, it might worth checking that there is no network packet corruption (some strange things in netstat -di, netstat -s, may be copying large files via net and comparing checksums).
>
I will post these as well, however so far no indication of any network 
problems was seen, no interface errors etc. Might be also the ix driver 
is not reporting such, of course.

One additional note: while playing with this setup, I tried to simulate 
local disk going away in the hope HAST will switch to using the remote 
disk. Instead of asking someone at the site to pull out the drive, I 
just issued on the primary

hastctl role init data0

which resulted in kernel panic. Unfortunately, there was no sufficient 
dump space for 48GB. I will re-run this again with more drives for the 
crash dump. Anything you want me to look for in particular? (kernels 
have no KDB compiled in yet)

Daniel