Storage 'failover' largely kills FreeBSD 10.x under XenServer?

Fri Sep 22 09:14:12 UTC 2017

--On 21 September 2017 15:49 +0100 Karl Pielorz <kpielorz_lst at tdx.co.uk> 
wrote:

>> Are these timeouts coming from Dom0 or from a VM in a DomU?
>
> domU - as above, dom0 grumbles, but generally seems OK about things. dom0
> doesn't do anything silly like invalidate the VM's disks or anything.

I've chased this down in the code - having briefly looked at 
blkfront/blkback - I can see all the mechanisms in place for performing I/O 
- but I cannot see there's any timeouts set anywhere (in that code).

I can see the callback that fires when the I/O fails.

It looks like for the purposes of xbd I/O requests are just gathered up, 
processed - and then fired off to XenServer (i.e. upstream). If they fail, 
callbacks are fired - and action taken.

But nowhere can I see where there are any timeouts either specified, or 
specifiable by FreeBSD - nor can I see (certainly at that level) that there 
are any I/O retries in that code.

So,

  - Timeouts may be set by Xen (i.e. outside of FreeBSD's scope)
  - I/O may be retried by 'higher levels' than blkfront/blkback - but I 
can't see where.

It may simply be that I/O from FreeBSD through XenServer is a 'fire and 
forget' process, where FreeBSD has no control over timeouts, and currently 
has no code (at that level) to perform retries.

I'd need to figure out what sits above 'blkfront/blkback' - and whether 
that's likely to do any retries.

It's certainly not CAM running the storage - so those timeout/retry sysctl 
values are completely irrelevant.

More study, and maybe a quick post to -hackers to see what lies above 
blkfront/back etc.

-Kp