add BIO_NORETRY flag, implement support in ata_da, use in ZFS vdev_geom

Sat Nov 25 16:36:31 UTC 2017

On Fri, Nov 24, 2017 at 10:20 AM, Andriy Gapon <avg at freebsd.org> wrote:

> On 24/11/2017 18:33, Warner Losh wrote:
> >
> >
> > On Fri, Nov 24, 2017 at 6:34 AM, Andriy Gapon <avg at freebsd.org
> > <mailto:avg at freebsd.org>> wrote:
> >
> >     On 24/11/2017 15:08, Warner Losh wrote:
> >     >
> >     >
> >     > On Fri, Nov 24, 2017 at 3:30 AM, Andriy Gapon <avg at freebsd.org
> <mailto:avg at freebsd.org>
> >     > <mailto:avg at freebsd.org <mailto:avg at freebsd.org>>> wrote:
> >     >
> >     >
> >     >     https://reviews.freebsd.org/D13224
> >     <https://reviews.freebsd.org/D13224> <https://reviews.freebsd.org/
> D13224
> >     <https://reviews.freebsd.org/D13224>>
> >     >
> >     >     Anyone interested is welcome to join the review.
> >     >
> >     >
> >     > I think it's a really bad idea. It introduces a
> 'one-size-fits-all' notion of
> >     > QoS that seems misguided. It conflates a shorter timeout with
> don't retry. And
> >     > why is retrying bad? It seems more a notion of 'fail fast' or so
> other concept.
> >     > There's so many other ways you'd want to use it. And it uses the
> same return
> >     > code (EIO) to mean something new. It's generally meant 'The lower
> layers have
> >     > retried this, and it failed, do not submit it again as it will not
> succeed' with
> >     > 'I gave it a half-assed attempt, and that failed, but resubmission
> might work'.
> >     > This breaks a number of assumptions in the BUF/BIO layer as well
> as parts of CAM
> >     > even more than they are broken now.
> >     >
> >     > So let's step back a bit: what problem is it trying to solve?
> >
> >     A simple example.  I have a mirror, I issue a read to one of its
> members.  Let's
> >     assume there is some trouble with that particular block on that
> particular disk.
> >      The disk may spend a lot of time trying to read it and would still
> fail.  With
> >     the current defaults I would wait 5x that time to finally get the
> error back.
> >     Then I go to another mirror member and get my data from there.
> >     IMO, this is not optimal.  I'd rather pass BIO_NORETRY to the first
> read, get
> >     the error back sooner and try the other disk sooner.  Only if I know
> that there
> >     are no other copies to try, then I would use the normal read with
> all the
> >     retrying.
> >
> >
> > It sounds like you are optimizing the wrong thing and taking an overly
> > simplistic view of quality of service.
> > First, failing blocks on a disk is fairly rare. Do you really want to
> optimize
> > for that case?
>
> If it can be done without any harm to the sunny day scenario, then why not?
> I think that 'robustness' is the word here, not 'optimization'.

I fail to see how it is a robustness issue. You've not made that case. You
want the I/O to fail fast so you can give another disk a shot sooner.
That's optimization.

> Second, you're really saying 'If you can't read it fast, fail" since we
> only
> > control the software side of read retry.
>
> Am I?
> That's not what I wanted to say, really.  I just wanted to say, if this I/O
> fails, don't retry it, leave it to me.
> This is very simple, simplistic as you say, but I like simple.

Right. Simple doesn't make it right. In fact, simple often makes it wrong.
We have big issues with the nvd device today because it's mindlessly queues
all the trim requests to the NVMe device w/o collapsing them, resulting in
horrible performance.

> There's new op codes being proposed
> > that say 'read or fail within Xms' which is really what you want: if
> it's taking
> > too long on disk A you want to move to disk B. The notion here was we'd
> return
> > EAGAIN (or some other error) if it failed after Xms, and maybe do some
> emulation
> > in software for drives that don't support this. You'd tweak this number
> to
> > control performance. You're likely to get a much bigger performance win
> all the
> > time by scheduling I/O to drives that have the best recent latency.
>
> ZFS already does some latency based decisions.
> The things that you describe are very interesting, but they are for the
> future.
>
> > Third, do you have numbers that show this is actually a win?
>
> I do not have any numbers right now.
> What kind of numbers would you like?  What kind of scenarios?

The usual kind. How is latency for I/O improved when you have a disk with a
few failing sectors that take a long time to read (which isn't a given:
some sectors fail fast). What happens when you have a failed disk? etc. How
does this compare with the current system.

Basically, how do you know this will really make things better and isn't
some kind of 'feel good' thing about 'doing something clever' about the
problem that may actually make things worse.

> This is a terrible
> > thing from an architectural view.
>
> You have said this several times, but unfortunately you haven't explained
> it yet.

I have explained it. You weren't listening.

1. It breaks the EIO contract that's currently in place.
2. It presumes to know what kind of retries should be done at the upper
layers where today we have a system that's more black and white. You don't
know the same info the low layers have to know whether to try another
drive, or just retry this one.
3. It assumes that retries are the source of latency in the system. they
aren't necessarily.
4. It assumes retries are necessarily slow: they may be, they might not be.
All depends on the drive (SSDs repeated I/O are often faster than actual
I/O).
5. It's just one bit when you really need more complex nuances to get good
QoE out of the I/O system. Retries is an incidental detail that's not that
important, while latency is what you care most about minimizing. You
wouldn't care if I tried to read the data 20 times if it got the result
faster than going to a different drive.
6. It's putting the wrong kind of specific hints into the mix.

> Absent numbers that show it's a big win, I'm
> > very hesitant to say OK.
> >
> > Forth, there's a large number of places in the stack today that need to
> > communicate their I/O is more urgent, and we don't have any good way to
> > communicate even that simple concept down the stack.
>
> That's unfortunately, but my proposal has quite little to do with I/O
> scheduling, priorities, etc.

Except it does. It dictates error recovery policy which is I/O scheduling.

> Finally, the only places that ZFS uses the TRYHARDER flag are for things
> like
> > the super block if I'm reading the code right. It doesn't do it for
> normal I/O.
>
> Right.  But for normal I/O there is ZIO_FLAG_IO_RETRY which is honored in
> the
> same way as ZIO_FLAG_TRYHARD.
>
> > There's no code to cope with what would happen if all the copies of a
> block
> > couldn't be read with the NORETRY flag. One of them might contain the
> data.
>
> ZFS is not that fragile :) see ZIO_FLAG_IO_RETRY above.
>

Except TRYHARD in ZFS means 'don't fail ****OTHER**** I/O in the queue when
an I/O fails' It doesn't control retries at all in Solaris. It's a
different concept entirely, and one badly thought out.

Warner