add BIO_NORETRY flag, implement support in ata_da, use in ZFS vdev_geom

Sat Nov 25 22:17:51 UTC 2017

On Sat, Nov 25, 2017 at 10:40 AM, Andriy Gapon <avg at freebsd.org> wrote:

>
> Before anything else, I would like to say that I got an impression that we
> speak
> from so different angles that we either don't understand each other's
> words or,
> even worse, misinterpret them.

I understand what you are suggesting. Don't take my disagreement with your
proposal as willful misinterpretation. You are proposing something that's a
quick hack. Maybe a useful one, but it's still problematical because it has
the upper layers telling the lower layers what to do (don't do your retry),
rather than what service to provide (I prefer a fast error exit to over
every effort to recover the data). And it also does it by overloading the
meaning of EIO, which has real problems which you've not been open to
listening, I assume due to your narrow use case apparently blinding you to
the bigger picture issues with that route.

However, there's a way forward which I think that will solve these
objections. First, designate that I/O that fails due to short-circuiting
the normal recovery process, return ETIMEDOUT. The I/O stack currently
doesn't use this at all (it was introduced for the network side of things).
This is a general catch-all for an I/O that we complete before the lower
layers have given it the maximum amount of effort to recover the data, at
the user request. Next, don't use a flag. Instead add a 32-bit field that
is call bio_qos for quality of service hints and another 32-bit field for
bio_qos_param. This allows us to pass down specific quality of service
desires from the filesystem to the lower layers. The parameter will be
unused in your proposal. BIO_QOS_FAIL_EARLY may be a good name for a value
to set it to (at the moment, just use 1). We'll assign the other QOS values
later for other things. It would allow us to implement the other sorts of
QoS things I talked about as well.

As for B_FAILFAST, it's quite unlike what you're proposing, except in one
incidental detail. It's a complicated state machine that the sd driver in
solaris implemented. It's an entire protocol. When the device gets errors,
it goes into this failfast state machine. The state machine makes a
determination that the errors are indicators the device is GONE, at least
for the moment, and it will fail I/Os in various ways from there. Any new
I/Os that are submitted will be failed (there's conditional behavior here:
depending on a global setting it's either all I/O or just B_FAILFAST I/O).
ZFS appears to set this bit for its discovery code only, when a device not
being there would significantly delay things. Anyway, when the device
returns (basically an I/O gets through or maybe some other event happens),
the driver exists this mode and returns to normal operation. It appears to
be designed not for the use case that you described, but rather for a drive
that's failing all over the place so that any pending I/Os get out of the
way quickly. Your use case is only superficially similar to that use case,
so the Solaris / Illumos experiences are mildly interesting, but due to the
differences not a strong argument for doing this. This facility in Illumos
is interesting, but would require significantly more retooling of the lower
I/O layers in FreeBSD to implement fully. Plus Illumos (or maybe just
Solaris) a daemon that looks at failures to manage them at a higher level,
which might make for a better user experience for FreeBSD, so that's
something that needs to be weighed as well.

We've known for some time that HDD retry algorithms take a long time. Same
is true of some SSD or NVMe algorithms, but not all. The other objection I
have to 'noretry' naming  is that it bakes the current observed HDD
behavior and recovery into the API. This is undesirable as other storage
technologies have retry mechanisms that happen quite quickly (and sometimes
in the drive itself). The cutoff between fast and slow recovery is device
specific, as are the methods used. For example, there's new proposals out
in NVMe (and maybe T10/T13 land) to have new types of READ commands that
specify the quality of service you expect, including providing some sort of
deadline hint to clip how much effort is expended in trying to recover the
data. It would be nice to design a mechanism that allows us to start using
these commands when drives are available with them, and possibly using
timeouts to allow for a faster abort. Most of your HDD I/O will complete
within maybe ~150ms, with a long tail out to maybe as long as ~400ms. It
might be desirable to set a policy that says 'don't let any I/Os remain in
the device longer than a second' and use this mechanism to enforce that. Or
don't let any I/Os last more than 20x the most recent median I/O time. A
single bit is insufficiently expressive to allow these sorts of things,
which is another reason for my objection to your proposal. With the QOS
fields being independent, the clone routines just copies them and makes no
judgement value on them.

So, those are my problems with your proposal, and also some hopefully
useful ways to move forward. I've chatted with others for years about
introducing QoS things into the I/O stack, so I know most of the above
won't be too contentious (though ETIMEDOUT I haven't socialized, so that
may be an area of concern for people).

Warner