add BIO_NORETRY flag, implement support in ata_da, use in ZFS vdev_geom

Tue Dec 12 16:36:59 UTC 2017

On 26/11/2017 00:17, Warner Losh wrote:
> 
> 
> On Sat, Nov 25, 2017 at 10:40 AM, Andriy Gapon <avg at freebsd.org
> <mailto:avg at freebsd.org>> wrote:
> 
> 
>     Before anything else, I would like to say that I got an impression that we speak
>     from so different angles that we either don't understand each other's words or,
>     even worse, misinterpret them.
> 
> 
> I understand what you are suggesting. Don't take my disagreement with your
> proposal as willful misinterpretation. You are proposing something that's a
> quick hack.

Very true.

> Maybe a useful one, but it's still problematical because it has the
> upper layers telling the lower layers what to do (don't do your retry), rather
> than what service to provide (I prefer a fast error exit to over every effort to
> recover the data).

Also true.

> And it also does it by overloading the meaning of EIO, which
> has real problems which you've not been open to listening, I assume due to your
> narrow use case apparently blinding you to the bigger picture issues with that
> route.

Quite likely.

> However, there's a way forward which I think that will solve these objections.
> First, designate that I/O that fails due to short-circuiting the normal recovery
> process, return ETIMEDOUT. The I/O stack currently doesn't use this at all (it
> was introduced for the network side of things). This is a general catch-all for
> an I/O that we complete before the lower layers have given it the maximum amount
> of effort to recover the data, at the user request. Next, don't use a flag.
> Instead add a 32-bit field that is call bio_qos for quality of service hints and
> another 32-bit field for bio_qos_param. This allows us to pass down specific
> quality of service desires from the filesystem to the lower layers. The
> parameter will be unused in your proposal. BIO_QOS_FAIL_EARLY may be a good name
> for a value to set it to (at the moment, just use 1). We'll assign the other QOS
> values later for other things. It would allow us to implement the other sorts of
> QoS things I talked about as well.

That's a very interesting and workable suggestion.
I will try to work on it.

> As for B_FAILFAST, it's quite unlike what you're proposing, except in one
> incidental detail. It's a complicated state machine that the sd driver in
> solaris implemented. It's an entire protocol. When the device gets errors, it
> goes into this failfast state machine. The state machine makes a determination
> that the errors are indicators the device is GONE, at least for the moment, and
> it will fail I/Os in various ways from there. Any new I/Os that are submitted
> will be failed (there's conditional behavior here: depending on a global setting
> it's either all I/O or just B_FAILFAST I/O).

Yeah, I realized that B_FAILFAST was quite different from the first impression
that I got from its name.
Thank you for doing and sharing your analysis of how it actually works.

> ZFS appears to set this bit for its
> discovery code only, when a device not being there would significantly delay
> things.

I think that ZFS sets the bit for all 'first-attempt' I/O.
It's the various retries / recovery where this bit is not set.

> Anyway, when the device returns (basically an I/O gets through or maybe
> some other event happens), the driver exists this mode and returns to normal
> operation. It appears to be designed not for the use case that you described,
> but rather for a drive that's failing all over the place so that any pending
> I/Os get out of the way quickly. Your use case is only superficially similar to
> that use case, so the Solaris / Illumos experiences are mildly interesting, but
> due to the differences not a strong argument for doing this. This facility in
> Illumos is interesting, but would require significantly more retooling of the
> lower I/O layers in FreeBSD to implement fully. Plus Illumos (or maybe just
> Solaris) a daemon that looks at failures to manage them at a higher level, which
> might make for a better user experience for FreeBSD, so that's something that
> needs to be weighed as well.

Okay.

> We've known for some time that HDD retry algorithms take a long time. Same is
> true of some SSD or NVMe algorithms, but not all. The other objection I have to
> 'noretry' naming  is that it bakes the current observed HDD behavior and
> recovery into the API. This is undesirable as other storage technologies have
> retry mechanisms that happen quite quickly (and sometimes in the drive itself).
> The cutoff between fast and slow recovery is device specific, as are the methods
> used. For example, there's new proposals out in NVMe (and maybe T10/T13 land) to
> have new types of READ commands that specify the quality of service you expect,
> including providing some sort of deadline hint to clip how much effort is
> expended in trying to recover the data. It would be nice to design a mechanism
> that allows us to start using these commands when drives are available with
> them, and possibly using timeouts to allow for a faster abort. Most of your HDD
> I/O will complete within maybe ~150ms, with a long tail out to maybe as long as
> ~400ms. It might be desirable to set a policy that says 'don't let any I/Os
> remain in the device longer than a second' and use this mechanism to enforce
> that. Or don't let any I/Os last more than 20x the most recent median I/O time.
> A single bit is insufficiently expressive to allow these sorts of things, which
> is another reason for my objection to your proposal. With the QOS fields being
> independent, the clone routines just copies them and makes no judgement value on
> them.

I now agree with this.
Thank you for the detailed explanation.

> So, those are my problems with your proposal, and also some hopefully useful
> ways to move forward. I've chatted with others for years about introducing QoS
> things into the I/O stack, so I know most of the above won't be too contentious
> (though ETIMEDOUT I haven't socialized, so that may be an area of concern for
> people).

Thank you!

-- 
Andriy Gapon