add BIO_NORETRY flag, implement support in ata_da, use in ZFS vdev_geom

Fri Nov 24 17:21:01 UTC 2017

On 24/11/2017 18:33, Warner Losh wrote:
> 
> 
> On Fri, Nov 24, 2017 at 6:34 AM, Andriy Gapon <avg at freebsd.org
> <mailto:avg at freebsd.org>> wrote:
> 
>     On 24/11/2017 15:08, Warner Losh wrote:
>     >
>     >
>     > On Fri, Nov 24, 2017 at 3:30 AM, Andriy Gapon <avg at freebsd.org <mailto:avg at freebsd.org>
>     > <mailto:avg at freebsd.org <mailto:avg at freebsd.org>>> wrote:
>     >
>     >
>     >     https://reviews.freebsd.org/D13224
>     <https://reviews.freebsd.org/D13224> <https://reviews.freebsd.org/D13224
>     <https://reviews.freebsd.org/D13224>>
>     >
>     >     Anyone interested is welcome to join the review.
>     >
>     >
>     > I think it's a really bad idea. It introduces a 'one-size-fits-all' notion of
>     > QoS that seems misguided. It conflates a shorter timeout with don't retry. And
>     > why is retrying bad? It seems more a notion of 'fail fast' or so other concept.
>     > There's so many other ways you'd want to use it. And it uses the same return
>     > code (EIO) to mean something new. It's generally meant 'The lower layers have
>     > retried this, and it failed, do not submit it again as it will not succeed' with
>     > 'I gave it a half-assed attempt, and that failed, but resubmission might work'.
>     > This breaks a number of assumptions in the BUF/BIO layer as well as parts of CAM
>     > even more than they are broken now.
>     >
>     > So let's step back a bit: what problem is it trying to solve?
> 
>     A simple example.  I have a mirror, I issue a read to one of its members.  Let's
>     assume there is some trouble with that particular block on that particular disk.
>      The disk may spend a lot of time trying to read it and would still fail.  With
>     the current defaults I would wait 5x that time to finally get the error back.
>     Then I go to another mirror member and get my data from there.
>     IMO, this is not optimal.  I'd rather pass BIO_NORETRY to the first read, get
>     the error back sooner and try the other disk sooner.  Only if I know that there
>     are no other copies to try, then I would use the normal read with all the
>     retrying.
> 
> 
> It sounds like you are optimizing the wrong thing and taking an overly
> simplistic view of quality of service.
> First, failing blocks on a disk is fairly rare. Do you really want to optimize
> for that case?

If it can be done without any harm to the sunny day scenario, then why not?
I think that 'robustness' is the word here, not 'optimization'.

> Second, you're really saying 'If you can't read it fast, fail" since we only
> control the software side of read retry.

Am I?
That's not what I wanted to say, really.  I just wanted to say, if this I/O
fails, don't retry it, leave it to me.
This is very simple, simplistic as you say, but I like simple.

> There's new op codes being proposed
> that say 'read or fail within Xms' which is really what you want: if it's taking
> too long on disk A you want to move to disk B. The notion here was we'd return
> EAGAIN (or some other error) if it failed after Xms, and maybe do some emulation
> in software for drives that don't support this. You'd tweak this number to
> control performance. You're likely to get a much bigger performance win all the
> time by scheduling I/O to drives that have the best recent latency.

ZFS already does some latency based decisions.
The things that you describe are very interesting, but they are for the future.

> Third, do you have numbers that show this is actually a win?

I do not have any numbers right now.
What kind of numbers would you like?  What kind of scenarios?

> This is a terrible
> thing from an architectural view.

You have said this several times, but unfortunately you haven't explained it yet.

> Absent numbers that show it's a big win, I'm
> very hesitant to say OK.
> 
> Forth, there's a large number of places in the stack today that need to
> communicate their I/O is more urgent, and we don't have any good way to
> communicate even that simple concept down the stack.

That's unfortunately, but my proposal has quite little to do with I/O
scheduling, priorities, etc.

> Finally, the only places that ZFS uses the TRYHARDER flag are for things like
> the super block if I'm reading the code right. It doesn't do it for normal I/O.

Right.  But for normal I/O there is ZIO_FLAG_IO_RETRY which is honored in the
same way as ZIO_FLAG_TRYHARD.

> There's no code to cope with what would happen if all the copies of a block
> couldn't be read with the NORETRY flag. One of them might contain the data.

ZFS is not that fragile :) see ZIO_FLAG_IO_RETRY above.

-- 
Andriy Gapon