HAST instability

Sun May 29 10:14:09 UTC 2011

I am trying to get a basic HAST setup working on 8-stable (as of today). 
Hardware is two supermicro blades, each with 2x Xeon E5620 processors, 
48GB RAM, integrated LSI2008 controller, two 600GB SAS2 Toshiba drives, 
two Intel gigabit interfaces and two Intel 10Gbit interfaces.

On each of the drives there is an GPT partition intended to be used by 
HAST.  Each host thus has two HAST resources, data0 and data1 
respectively. HAST is run over the 10Gbit interfaces, connected via the 
blade chasis 10Gbit switch.

/etc/hast.conf is

resource data0 {
         on b1a {
                 local /dev/gpt/data0
                 remote 10.2.101.12
         }
         on b1b {
                 local /dev/gpt/data0
                 remote 10.2.101.11
         }
}

resource data1 {
         on b1a {
                 local /dev/gpt/data1
                 remote 10.2.101.12
         }
         on b1b {
                 local /dev/gpt/data1
                 remote 10.2.101.11
         }
}

On top of data0 and data1 I run ZFS mirror, although this doesn't seem 
to be relevant here.

What I am observing is very jumpy performance, both nodes often 
disconnect with primary:

May 29 13:06:33 b1b hastd[2372]: [data0] (primary) Unable to receive 
reply header: Socket is not connected.
May 29 13:06:33 b1b hastd[2372]: [data0] (primary) Unable to send 
request (Broken pipe): WRITE(60470853632, 131072).
May 29 13:06:33 b1b hastd[2372]: [data0] (primary) Disconnected from 
10.2.101.11.
May 29 13:06:33 b1b hastd[2372]: [data0] (primary) Unable to write 
synchronization data: Socket is not connected.

on secondary:

May 29 03:03:14 b1a hastd[28357]: [data1] (secondary) Unable to receive 
request header: RPC version wrong.
May 29 03:03:19 b1a hastd[11659]: [data1] (secondary) Worker process 
exited ungracefully (pid=28357, exitcode=75).
May 29 03:05:31 b1a hastd[35535]: [data0] (secondary) Unable to receive 
request header: RPC version wrong.
May 29 03:05:36 b1a hastd[11659]: [data0] (secondary) Worker process 
exited ungracefully (pid=35535, exitcode=75).

When it works, replication rate observed with 'systat -if' is over 
140MB/sec (perhaps limited by drives write troughput)

The only reference to this error messages I found in 
http://lists.freebsd.org/pipermail/freebsd-stable/2010-November/059817.html, 
and that thread indicated the fix was commited.

About the only tuning these machines have is to set 
kern.ipc.nmbclusters=51200, because with the default values 10Gbit 
interfaces would not work and anyway the system would run out of mbufs.

Has anyone observed something similar? Any ideas how to fix it?

Daniel