add BIO_NORETRY flag, implement support in ata_da, use in ZFS vdev_geom
Andriy Gapon
avg at FreeBSD.org
Fri Nov 24 17:21:01 UTC 2017
On 24/11/2017 18:33, Warner Losh wrote:
>
>
> On Fri, Nov 24, 2017 at 6:34 AM, Andriy Gapon <avg at freebsd.org
> <mailto:avg at freebsd.org>> wrote:
>
> On 24/11/2017 15:08, Warner Losh wrote:
> >
> >
> > On Fri, Nov 24, 2017 at 3:30 AM, Andriy Gapon <avg at freebsd.org <mailto:avg at freebsd.org>
> > <mailto:avg at freebsd.org <mailto:avg at freebsd.org>>> wrote:
> >
> >
> > https://reviews.freebsd.org/D13224
> <https://reviews.freebsd.org/D13224> <https://reviews.freebsd.org/D13224
> <https://reviews.freebsd.org/D13224>>
> >
> > Anyone interested is welcome to join the review.
> >
> >
> > I think it's a really bad idea. It introduces a 'one-size-fits-all' notion of
> > QoS that seems misguided. It conflates a shorter timeout with don't retry. And
> > why is retrying bad? It seems more a notion of 'fail fast' or so other concept.
> > There's so many other ways you'd want to use it. And it uses the same return
> > code (EIO) to mean something new. It's generally meant 'The lower layers have
> > retried this, and it failed, do not submit it again as it will not succeed' with
> > 'I gave it a half-assed attempt, and that failed, but resubmission might work'.
> > This breaks a number of assumptions in the BUF/BIO layer as well as parts of CAM
> > even more than they are broken now.
> >
> > So let's step back a bit: what problem is it trying to solve?
>
> A simple example. I have a mirror, I issue a read to one of its members. Let's
> assume there is some trouble with that particular block on that particular disk.
> The disk may spend a lot of time trying to read it and would still fail. With
> the current defaults I would wait 5x that time to finally get the error back.
> Then I go to another mirror member and get my data from there.
> IMO, this is not optimal. I'd rather pass BIO_NORETRY to the first read, get
> the error back sooner and try the other disk sooner. Only if I know that there
> are no other copies to try, then I would use the normal read with all the
> retrying.
>
>
> It sounds like you are optimizing the wrong thing and taking an overly
> simplistic view of quality of service.
> First, failing blocks on a disk is fairly rare. Do you really want to optimize
> for that case?
If it can be done without any harm to the sunny day scenario, then why not?
I think that 'robustness' is the word here, not 'optimization'.
> Second, you're really saying 'If you can't read it fast, fail" since we only
> control the software side of read retry.
Am I?
That's not what I wanted to say, really. I just wanted to say, if this I/O
fails, don't retry it, leave it to me.
This is very simple, simplistic as you say, but I like simple.
> There's new op codes being proposed
> that say 'read or fail within Xms' which is really what you want: if it's taking
> too long on disk A you want to move to disk B. The notion here was we'd return
> EAGAIN (or some other error) if it failed after Xms, and maybe do some emulation
> in software for drives that don't support this. You'd tweak this number to
> control performance. You're likely to get a much bigger performance win all the
> time by scheduling I/O to drives that have the best recent latency.
ZFS already does some latency based decisions.
The things that you describe are very interesting, but they are for the future.
> Third, do you have numbers that show this is actually a win?
I do not have any numbers right now.
What kind of numbers would you like? What kind of scenarios?
> This is a terrible
> thing from an architectural view.
You have said this several times, but unfortunately you haven't explained it yet.
> Absent numbers that show it's a big win, I'm
> very hesitant to say OK.
>
> Forth, there's a large number of places in the stack today that need to
> communicate their I/O is more urgent, and we don't have any good way to
> communicate even that simple concept down the stack.
That's unfortunately, but my proposal has quite little to do with I/O
scheduling, priorities, etc.
> Finally, the only places that ZFS uses the TRYHARDER flag are for things like
> the super block if I'm reading the code right. It doesn't do it for normal I/O.
Right. But for normal I/O there is ZIO_FLAG_IO_RETRY which is honored in the
same way as ZIO_FLAG_TRYHARD.
> There's no code to cope with what would happen if all the copies of a block
> couldn't be read with the NORETRY flag. One of them might contain the data.
ZFS is not that fragile :) see ZIO_FLAG_IO_RETRY above.
--
Andriy Gapon
More information about the freebsd-fs
mailing list