add BIO_NORETRY flag, implement support in ata_da, use in ZFS vdev_geom

Fri Nov 24 16:33:58 UTC 2017

On Fri, Nov 24, 2017 at 6:34 AM, Andriy Gapon <avg at freebsd.org> wrote:

> On 24/11/2017 15:08, Warner Losh wrote:
> >
> >
> > On Fri, Nov 24, 2017 at 3:30 AM, Andriy Gapon <avg at freebsd.org
> > <mailto:avg at freebsd.org>> wrote:
> >
> >
> >     https://reviews.freebsd.org/D13224 <https://reviews.freebsd.org/
> D13224>
> >
> >     Anyone interested is welcome to join the review.
> >
> >
> > I think it's a really bad idea. It introduces a 'one-size-fits-all'
> notion of
> > QoS that seems misguided. It conflates a shorter timeout with don't
> retry. And
> > why is retrying bad? It seems more a notion of 'fail fast' or so other
> concept.
> > There's so many other ways you'd want to use it. And it uses the same
> return
> > code (EIO) to mean something new. It's generally meant 'The lower layers
> have
> > retried this, and it failed, do not submit it again as it will not
> succeed' with
> > 'I gave it a half-assed attempt, and that failed, but resubmission might
> work'.
> > This breaks a number of assumptions in the BUF/BIO layer as well as
> parts of CAM
> > even more than they are broken now.
> >
> > So let's step back a bit: what problem is it trying to solve?
>
> A simple example.  I have a mirror, I issue a read to one of its members.
> Let's
> assume there is some trouble with that particular block on that particular
> disk.
>  The disk may spend a lot of time trying to read it and would still fail.
> With
> the current defaults I would wait 5x that time to finally get the error
> back.
> Then I go to another mirror member and get my data from there.
> IMO, this is not optimal.  I'd rather pass BIO_NORETRY to the first read,
> get
> the error back sooner and try the other disk sooner.  Only if I know that
> there
> are no other copies to try, then I would use the normal read with all the
> retrying.
>

It sounds like you are optimizing the wrong thing and taking an overly
simplistic view of quality of service.

First, failing blocks on a disk is fairly rare. Do you really want to
optimize for that case?

Second, you're really saying 'If you can't read it fast, fail" since we
only control the software side of read retry. There's new op codes being
proposed that say 'read or fail within Xms' which is really what you want:
if it's taking too long on disk A you want to move to disk B. The notion
here was we'd return EAGAIN (or some other error) if it failed after Xms,
and maybe do some emulation in software for drives that don't support this.
You'd tweak this number to control performance. You're likely to get a much
bigger performance win all the time by scheduling I/O to drives that have
the best recent latency.

Third, do you have numbers that show this is actually a win? This is a
terrible thing from an architectural view. Absent numbers that show it's a
big win, I'm very hesitant to say OK.

Forth, there's a large number of places in the stack today that need to
communicate their I/O is more urgent, and we don't have any good way to
communicate even that simple concept down the stack.

Finally, the only places that ZFS uses the TRYHARDER flag are for things
like the super block if I'm reading the code right. It doesn't do it for
normal I/O. There's no code to cope with what would happen if all the
copies of a block couldn't be read with the NORETRY flag. One of them might
contain the data.

Warner