add BIO_NORETRY flag, implement support in ata_da, use in ZFS vdev_geom

Andriy Gapon avg at
Sat Nov 25 17:41:04 UTC 2017

Before anything else, I would like to say that I got an impression that we speak
from so different angles that we either don't understand each other's words or,
even worse, misinterpret them.

On 25/11/2017 18:36, Warner Losh wrote:
> On Fri, Nov 24, 2017 at 10:20 AM, Andriy Gapon <avg at
> <mailto:avg at>> wrote:
>     On 24/11/2017 18:33, Warner Losh wrote:
>     >
>     >
>     > On Fri, Nov 24, 2017 at 6:34 AM, Andriy Gapon <avg at <mailto:avg at>
>     > <mailto:avg at <mailto:avg at>>> wrote:
>     >
>     >     On 24/11/2017 15:08, Warner Losh wrote:
>     >     >
>     >     >
>     >     > On Fri, Nov 24, 2017 at 3:30 AM, Andriy Gapon <avg at <mailto:avg at> <mailto:avg at
>     <mailto:avg at>>
>     >     > <mailto:avg at <mailto:avg at> <mailto:avg at
>     <mailto:avg at>>>> wrote:
>     >     >
>     >     >
>     >     > <>
>     >     <
>     <>> <
>     <>
>     >     < <>>>
>     >     >
>     >     >     Anyone interested is welcome to join the review.
>     >     >
>     >     >
>     >     > I think it's a really bad idea. It introduces a 'one-size-fits-all'
>     notion of
>     >     > QoS that seems misguided. It conflates a shorter timeout with don't
>     retry. And
>     >     > why is retrying bad? It seems more a notion of 'fail fast' or so
>     other concept.
>     >     > There's so many other ways you'd want to use it. And it uses the
>     same return
>     >     > code (EIO) to mean something new. It's generally meant 'The lower
>     layers have
>     >     > retried this, and it failed, do not submit it again as it will not
>     succeed' with
>     >     > 'I gave it a half-assed attempt, and that failed, but resubmission
>     might work'.
>     >     > This breaks a number of assumptions in the BUF/BIO layer as well as
>     parts of CAM
>     >     > even more than they are broken now.
>     >     >
>     >     > So let's step back a bit: what problem is it trying to solve?
>     >
>     >     A simple example.  I have a mirror, I issue a read to one of its
>     members.  Let's
>     >     assume there is some trouble with that particular block on that
>     particular disk.
>     >      The disk may spend a lot of time trying to read it and would still
>     fail.  With
>     >     the current defaults I would wait 5x that time to finally get the
>     error back.
>     >     Then I go to another mirror member and get my data from there.
>     >     IMO, this is not optimal.  I'd rather pass BIO_NORETRY to the first
>     read, get
>     >     the error back sooner and try the other disk sooner.  Only if I know
>     that there
>     >     are no other copies to try, then I would use the normal read with all the
>     >     retrying.
>     >
>     >
>     > It sounds like you are optimizing the wrong thing and taking an overly
>     > simplistic view of quality of service.
>     > First, failing blocks on a disk is fairly rare. Do you really want to optimize
>     > for that case?
>     If it can be done without any harm to the sunny day scenario, then why not?
>     I think that 'robustness' is the word here, not 'optimization'.
> I fail to see how it is a robustness issue. You've not made that case. You want
> the I/O to fail fast so you can give another disk a shot sooner. That's
> optimization.

Then you can call a protection against denial-of-service an optimization too.
You want to do things faster, right?

>     > Second, you're really saying 'If you can't read it fast, fail" since we only
>     > control the software side of read retry.
>     Am I?
>     That's not what I wanted to say, really.  I just wanted to say, if this I/O
>     fails, don't retry it, leave it to me.
>     This is very simple, simplistic as you say, but I like simple.
> Right. Simple doesn't make it right. In fact, simple often makes it wrong.

I agree.  The same applies to complex well.  Let's stop at this.

> We
> have big issues with the nvd device today because it's mindlessly queues all the
> trim requests to the NVMe device w/o collapsing them, resulting in horrible
> performance.
>     > There's new op codes being proposed
>     > that say 'read or fail within Xms' which is really what you want: if it's taking
>     > too long on disk A you want to move to disk B. The notion here was we'd return
>     > EAGAIN (or some other error) if it failed after Xms, and maybe do some emulation
>     > in software for drives that don't support this. You'd tweak this number to
>     > control performance. You're likely to get a much bigger performance win all the
>     > time by scheduling I/O to drives that have the best recent latency.
>     ZFS already does some latency based decisions.
>     The things that you describe are very interesting, but they are for the future.
>     > Third, do you have numbers that show this is actually a win?
>     I do not have any numbers right now.
>     What kind of numbers would you like?  What kind of scenarios?
> The usual kind. How is latency for I/O improved when you have a disk with a few
> failing sectors that take a long time to read (which isn't a given: some sectors
> fail fast).

Today I gave an example of how four retries added about 9 seconds of additional
delay.  I think that that is significant.

> What happens when you have a failed disk? etc. How does this compare
> with the current system.

I haven't done such an experiment.  I guess it depends on how exactly the disk
fails.  There is a big difference between a disk dropping a link and a disk
turning into a black hole.

> Basically, how do you know this will really make things better and isn't some
> kind of 'feel good' thing about 'doing something clever' about the problem that
> may actually make things worse.
>     > This is a terrible
>     > thing from an architectural view.
>     You have said this several times, but unfortunately you haven't explained it
>     yet.
> I have explained it. You weren't listening.

This is the first time I see the below list or anything like it.

> 1. It breaks the EIO contract that's currently in place.

This needs further explanation.

> 2. It presumes to know what kind of retries should be done at the upper layers
> where today we have a system that's more black and white.

I don't understand this argument.  If your upper level code does not know how to
do retries, then it should not concern itself with that and should not use the flag.

> You don't know the
> same info the low layers have to know whether to try another drive, or just
> retry this one.

Eh?  Either we have different definitions of upper and lower layers or I don't
understand how lower layers (e.g. CAM) can know about another drive.

> 3. It assumes that retries are the source of latency in the system. they aren't
> necessarily.

I am not assuming that at all for the general case.

> 4. It assumes retries are necessarily slow: they may be, they might not be. All
> depends on the drive (SSDs repeated I/O are often faster than actual I/O).

Of course.  But X plus epsilon is always greater than X.  And we know than in
many practical cases epsilon can be rather large.

> 5. It's just one bit when you really need more complex nuances to get good QoE
> out of the I/O system. Retries is an incidental detail that's not that
> important, while latency is what you care most about minimizing. You wouldn't
> care if I tried to read the data 20 times if it got the result faster than going
> to a different drive.

That's a good point.  But then again, it's the upper layers that have a better
chance of predicting this kind of thing.  That is, if I know that my backup
storage is extremely slow, then I will allow the fast primary storage do all
retries it wants to do.  It's not CAM nor scsi_da nor a specific SIM that can
make those decisions.  It's an issuer of the I/O request [or an intermediate
geom that encapsulates that knowledge and effectively acts as an issuer of I/O-s
to the lower geoms].

> 6. It's putting the wrong kind of specific hints into the mix.

This needs further explanation.

>     > Absent numbers that show it's a big win, I'm
>     > very hesitant to say OK.
>     >
>     > Forth, there's a large number of places in the stack today that need to
>     > communicate their I/O is more urgent, and we don't have any good way to
>     > communicate even that simple concept down the stack.
>     That's unfortunately, but my proposal has quite little to do with I/O
>     scheduling, priorities, etc.
> Except it does. It dictates error recovery policy which is I/O scheduling.
>     > Finally, the only places that ZFS uses the TRYHARDER flag are for things like
>     > the super block if I'm reading the code right. It doesn't do it for normal I/O.
>     Right.  But for normal I/O there is ZIO_FLAG_IO_RETRY which is honored in the
>     same way as ZIO_FLAG_TRYHARD.
>     > There's no code to cope with what would happen if all the copies of a block
>     > couldn't be read with the NORETRY flag. One of them might contain the data.
>     ZFS is not that fragile :) see ZIO_FLAG_IO_RETRY above.
> Except TRYHARD in ZFS means 'don't fail ****OTHER**** I/O in the queue when an
> I/O fails' It doesn't control retries at all in Solaris. It's a different
> concept entirely, and one badly thought out.

I think that it does control retries.
And it does even more. My understanding is that bio-s with B_FAILFAST can be
failed immediately in the situation roughly equivalent to a CAM devq (or simq)
being frozen.

Andriy Gapon

More information about the freebsd-fs mailing list