HAST instability

Tue May 31 14:09:05 UTC 2011

On Tue, 31 May 2011 15:51:07 +0300 Daniel Kalchev wrote:

 DK> On 30.05.11 21:42, Mikolaj Golub wrote:
 >>   DK>  One strange thing is that there is never established TCP connection
 >>   DK>  between both nodes:
 >>
 >>   DK>  tcp4       0      0 10.2.101.11.48939      10.2.101.12.8457       FIN_WAIT_2
 >>   DK>  tcp4       0   1288 10.2.101.11.57008      10.2.101.12.8457       CLOSE_WAIT
 >>   DK>  tcp4       0      0 10.2.101.11.46346      10.2.101.12.8457       FIN_WAIT_2
 >>   DK>  tcp4       0  90648 10.2.101.11.13916      10.2.101.12.8457       CLOSE_WAIT
 >>   DK>  tcp4       0      0 10.2.101.11.8457       *.*                    LISTEN
 >>
 >> It is normal. hastd uses the connections only in one direction so it calls
 >> shutdown to close unused directions.
 DK> So the TCP connections are all too short-lived that I can never see a
 DK> single one in ESTABLISHED state? 10Gbit Ethernet is indeed fast, so
 DK> this might well be possible...

No the connections are persistent, just only one (unused) direction of
communication is closed. See shutdown(2) for further info.

 >> I would like to look at full logs for some rather large period, with several
 >> cases, from both primary and secondary (and be sure about synchronized time).
 DK> I have made sure clocks are synchronized and am currently running on a freshly rebooted nodes (with two additional SATA drives at each node) -- 
 DK> so far some interesting findings, like  I get hash errors and
 DK> disconnects much more frequent now. Will post when an bonnie++ run on
 DK> the ZFS filesystem on top of the HAST resources finishes.

As I wrote privately, it would be nice to see both netstat and hast logs (from
both nodes) for the same rather long period, when several cases occured. It
would be good to place them somewere on web so other guys could access them
too, as I will be offline for 7-10 days and will not be able to help you until
I am back.

 DK> One additional note: while playing with this setup, I tried to
 DK> simulate local disk going away in the hope HAST will switch to using
 DK> the remote disk. Instead of asking someone at the site to pull out the
 DK> drive, I just issued on the primary

 DK> hastctl role init data0

 DK> which resulted in kernel panic. Unfortunately, there was no sufficient
 DK> dump space for 48GB. I will re-run this again with more drives for the
 DK> crash dump. Anything you want me to look for in particular? (kernels
 DK> have no KDB compiled in yet)

Well, removing physical disk (device /dev/gpt/data0 consumed by hastd
dissapears) and switching a resource to init role (devive /dev/hast/data0
consumed by FS dissapears) are two different things. Sure you should not
normally change the resource role (destroy hast device) before unmounting
(exporting) FS.

-- 
Mikolaj Golub