add BIO_NORETRY flag, implement support in ata_da, use in ZFS vdev_geom

Sat Nov 25 16:58:38 UTC 2017

On 25/11/2017 18:25, Warner Losh wrote:
> 
> 
> On Fri, Nov 24, 2017 at 10:17 AM, Andriy Gapon <avg at freebsd.org
> <mailto:avg at freebsd.org>> wrote:
> 
>     On 24/11/2017 16:57, Scott Long wrote:
>     >
>     >
>     >> On Nov 24, 2017, at 6:34 AM, Andriy Gapon <avg at FreeBSD.org> wrote:
>     >>
>     >> On 24/11/2017 15:08, Warner Losh wrote:
>     >>>
>     >>>
>     >>> On Fri, Nov 24, 2017 at 3:30 AM, Andriy Gapon <avg at freebsd.org
>     <mailto:avg at freebsd.org>
>     >>> <mailto:avg at freebsd.org <mailto:avg at freebsd.org>>> wrote:
>     >>>
>     >>>
>     >>>    https://reviews.freebsd.org/D13224
>     <https://reviews.freebsd.org/D13224> <https://reviews.freebsd.org/D13224
>     <https://reviews.freebsd.org/D13224>>
>     >>>
>     >>>    Anyone interested is welcome to join the review.
>     >>>
>     >>>
>     >>> I think it's a really bad idea. It introduces a 'one-size-fits-all'
>     notion of
>     >>> QoS that seems misguided. It conflates a shorter timeout with don't
>     retry. And
>     >>> why is retrying bad? It seems more a notion of 'fail fast' or so other
>     concept.
>     >>> There's so many other ways you'd want to use it. And it uses the same return
>     >>> code (EIO) to mean something new. It's generally meant 'The lower layers
>     have
>     >>> retried this, and it failed, do not submit it again as it will not
>     succeed' with
>     >>> 'I gave it a half-assed attempt, and that failed, but resubmission might
>     work'.
>     >>> This breaks a number of assumptions in the BUF/BIO layer as well as
>     parts of CAM
>     >>> even more than they are broken now.
>     >>>
>     >>> So let's step back a bit: what problem is it trying to solve?
>     >>
>     >> A simple example.  I have a mirror, I issue a read to one of its
>     members.  Let's
>     >> assume there is some trouble with that particular block on that
>     particular disk.
>     >> The disk may spend a lot of time trying to read it and would still fail. 
>     With
>     >> the current defaults I would wait 5x that time to finally get the error back.
>     >> Then I go to another mirror member and get my data from there.
>     >
>     > There are many RAID stacks that already solve this problem by having a policy
>     > of always reading all disk members for every transaction, and throwing
>     away the
>     > sub-transactions that arrive late.  It’s not a policy that is always
>     desired, but it
>     > serves a useful purpose for low-latency needs.
> 
>     That's another possible and useful strategy.
> 
>     >> IMO, this is not optimal.  I'd rather pass BIO_NORETRY to the first read, get
>     >> the error back sooner and try the other disk sooner.  Only if I know that there
>     >> are no other copies to try, then I would use the normal read with all the retrying.
>     >>
>     >
>     > I agree with Warner that what you are proposing is not correct.  It weakens the
>     > contract between the disk layer and the upper layers, making it less clear who is
>     > responsible for retries and less clear what “EIO” means.  That contract is already
>     > weak due to poor design decisions in VFS-BIO and GEOM, and Warner and I
>     > are working on a plan to fix that.
> 
>     Well...  I do realize now that there is some problem in this area, both you and
>     Warner mentioned it.  But knowing that it exists is not the same as knowing what
>     it is :-)
>     I understand that it could be rather complex and not easy to describe in a short
>     email...
> 
>     But then, this flag is optional, it's off by default and no one is forced to
>     used it.  If it's used only by ZFS, then it would not be horrible.
> 
> 
> Except that it isn't the same flag as what Solaris has (its B_FAILFAST does
> something different: it isn't about limiting retries but about failing ALL the
> queued I/O for a unit, not just trying one retry), and the problems that it
> solves are quite rare. And if you return a different errno, then the EIO
> contract is still fulfilled. 

Yes, it isn't the same.
I think that illumos flag does even more.

>     Unless it makes things very hard for the infrastructure.
>     But I am circling back to not knowing what problem(s) you and Warner are
>     planning to fix.
> 
> 
> The middle layers of the I/O system are a bit fragile in the face of I/O errors.
> We're fixing that.

What are the middle layers?

> Of course, you still haven't articulated why this approach would be better

Better than what?

> nor
> show any numbers as to how it makes things better.

By now, I have.  See my reply to Scott's email.

-- 
Andriy Gapon