Storage 'failover' largely kills FreeBSD 10.x under XenServer?

Wed Sep 20 14:54:26 UTC 2017

--On 20 September 2017 at 12:44:18 +0100 Roger Pau Monné 
<roger.pau at citrix.com> wrote:

>> Is there some 'tuneable' we can set to make the 10.3 boxes more tolerant
>> of the I/O delays that occur during a storage fail over?
>
> Do you know whether the VMs saw the disks disconnecting and then
> connecting again?

I can't see any evidence the drives actually get 'disconnected' from the 
VM's point of view. Plenty of I/O errors - but no "device destroyed" type 
stuff.

I have seen that kind of error logged on our test kit - when deliberately 
failed non-HA storage, but I don't see it this time.

> Hm, I have the feeling that part of the problem is that in-flight
> requests are basically lost when a disconnect/reconnect happens.

So if a disconnect doesn't happen (as it appears it isn't) - is there any 
tunable to set the I/O timeout?

'sysctl -a | grep timeout' finds things like:

  kern.cam.ada.default_timeout=30

I might see if that has any effect (from memory - as I'm out of the office 
now - it did seem to be about 30 seconds before the VM's started logging 
I/O related errors to the console).

As it's a pure test setup - I can try adjusting this without fear of 
breaking anything :)

Though I'm open to other suggestions...

fwiw - Who's responsibility is it to re-send lost "in flight" data, e.g. if 
a write is 'in flight' when an I/O error occurs in the lower layers of 
XenServer is it XenServers responsibility to retry that - before giving up, 
or does it just push the error straight back to the VM - expecting the VM 
to retry it? [or a bit of both?] - just curious.

-Karl