HAST instability
Mikolaj Golub
trociny at freebsd.org
Tue May 31 14:09:05 UTC 2011
On Tue, 31 May 2011 15:51:07 +0300 Daniel Kalchev wrote:
DK> On 30.05.11 21:42, Mikolaj Golub wrote:
>> DK> One strange thing is that there is never established TCP connection
>> DK> between both nodes:
>>
>> DK> tcp4 0 0 10.2.101.11.48939 10.2.101.12.8457 FIN_WAIT_2
>> DK> tcp4 0 1288 10.2.101.11.57008 10.2.101.12.8457 CLOSE_WAIT
>> DK> tcp4 0 0 10.2.101.11.46346 10.2.101.12.8457 FIN_WAIT_2
>> DK> tcp4 0 90648 10.2.101.11.13916 10.2.101.12.8457 CLOSE_WAIT
>> DK> tcp4 0 0 10.2.101.11.8457 *.* LISTEN
>>
>> It is normal. hastd uses the connections only in one direction so it calls
>> shutdown to close unused directions.
DK> So the TCP connections are all too short-lived that I can never see a
DK> single one in ESTABLISHED state? 10Gbit Ethernet is indeed fast, so
DK> this might well be possible...
No the connections are persistent, just only one (unused) direction of
communication is closed. See shutdown(2) for further info.
>> I would like to look at full logs for some rather large period, with several
>> cases, from both primary and secondary (and be sure about synchronized time).
DK> I have made sure clocks are synchronized and am currently running on a freshly rebooted nodes (with two additional SATA drives at each node) --
DK> so far some interesting findings, like I get hash errors and
DK> disconnects much more frequent now. Will post when an bonnie++ run on
DK> the ZFS filesystem on top of the HAST resources finishes.
As I wrote privately, it would be nice to see both netstat and hast logs (from
both nodes) for the same rather long period, when several cases occured. It
would be good to place them somewere on web so other guys could access them
too, as I will be offline for 7-10 days and will not be able to help you until
I am back.
DK> One additional note: while playing with this setup, I tried to
DK> simulate local disk going away in the hope HAST will switch to using
DK> the remote disk. Instead of asking someone at the site to pull out the
DK> drive, I just issued on the primary
DK> hastctl role init data0
DK> which resulted in kernel panic. Unfortunately, there was no sufficient
DK> dump space for 48GB. I will re-run this again with more drives for the
DK> crash dump. Anything you want me to look for in particular? (kernels
DK> have no KDB compiled in yet)
Well, removing physical disk (device /dev/gpt/data0 consumed by hastd
dissapears) and switching a resource to init role (devive /dev/hast/data0
consumed by FS dissapears) are two different things. Sure you should not
normally change the resource role (destroy hast device) before unmounting
(exporting) FS.
--
Mikolaj Golub
More information about the freebsd-stable
mailing list