add BIO_NORETRY flag, implement support in ata_da, use in ZFS vdev_geom
avg at FreeBSD.org
Sat Nov 25 17:41:04 UTC 2017
Before anything else, I would like to say that I got an impression that we speak
from so different angles that we either don't understand each other's words or,
even worse, misinterpret them.
On 25/11/2017 18:36, Warner Losh wrote:
> On Fri, Nov 24, 2017 at 10:20 AM, Andriy Gapon <avg at freebsd.org
> <mailto:avg at freebsd.org>> wrote:
> On 24/11/2017 18:33, Warner Losh wrote:
> > On Fri, Nov 24, 2017 at 6:34 AM, Andriy Gapon <avg at freebsd.org <mailto:avg at freebsd.org>
> > <mailto:avg at freebsd.org <mailto:avg at freebsd.org>>> wrote:
> > On 24/11/2017 15:08, Warner Losh wrote:
> > >
> > >
> > > On Fri, Nov 24, 2017 at 3:30 AM, Andriy Gapon <avg at freebsd.org <mailto:avg at freebsd.org> <mailto:avg at freebsd.org
> <mailto:avg at freebsd.org>>
> > > <mailto:avg at freebsd.org <mailto:avg at freebsd.org> <mailto:avg at freebsd.org
> <mailto:avg at freebsd.org>>>> wrote:
> > >
> > >
> > > https://reviews.freebsd.org/D13224 <https://reviews.freebsd.org/D13224>
> > <https://reviews.freebsd.org/D13224
> <https://reviews.freebsd.org/D13224>> <https://reviews.freebsd.org/D13224
> > <https://reviews.freebsd.org/D13224 <https://reviews.freebsd.org/D13224>>>
> > >
> > > Anyone interested is welcome to join the review.
> > >
> > >
> > > I think it's a really bad idea. It introduces a 'one-size-fits-all'
> notion of
> > > QoS that seems misguided. It conflates a shorter timeout with don't
> retry. And
> > > why is retrying bad? It seems more a notion of 'fail fast' or so
> other concept.
> > > There's so many other ways you'd want to use it. And it uses the
> same return
> > > code (EIO) to mean something new. It's generally meant 'The lower
> layers have
> > > retried this, and it failed, do not submit it again as it will not
> succeed' with
> > > 'I gave it a half-assed attempt, and that failed, but resubmission
> might work'.
> > > This breaks a number of assumptions in the BUF/BIO layer as well as
> parts of CAM
> > > even more than they are broken now.
> > >
> > > So let's step back a bit: what problem is it trying to solve?
> > A simple example. I have a mirror, I issue a read to one of its
> members. Let's
> > assume there is some trouble with that particular block on that
> particular disk.
> > The disk may spend a lot of time trying to read it and would still
> fail. With
> > the current defaults I would wait 5x that time to finally get the
> error back.
> > Then I go to another mirror member and get my data from there.
> > IMO, this is not optimal. I'd rather pass BIO_NORETRY to the first
> read, get
> > the error back sooner and try the other disk sooner. Only if I know
> that there
> > are no other copies to try, then I would use the normal read with all the
> > retrying.
> > It sounds like you are optimizing the wrong thing and taking an overly
> > simplistic view of quality of service.
> > First, failing blocks on a disk is fairly rare. Do you really want to optimize
> > for that case?
> If it can be done without any harm to the sunny day scenario, then why not?
> I think that 'robustness' is the word here, not 'optimization'.
> I fail to see how it is a robustness issue. You've not made that case. You want
> the I/O to fail fast so you can give another disk a shot sooner. That's
Then you can call a protection against denial-of-service an optimization too.
You want to do things faster, right?
> > Second, you're really saying 'If you can't read it fast, fail" since we only
> > control the software side of read retry.
> Am I?
> That's not what I wanted to say, really. I just wanted to say, if this I/O
> fails, don't retry it, leave it to me.
> This is very simple, simplistic as you say, but I like simple.
> Right. Simple doesn't make it right. In fact, simple often makes it wrong.
I agree. The same applies to complex well. Let's stop at this.
> have big issues with the nvd device today because it's mindlessly queues all the
> trim requests to the NVMe device w/o collapsing them, resulting in horrible
> > There's new op codes being proposed
> > that say 'read or fail within Xms' which is really what you want: if it's taking
> > too long on disk A you want to move to disk B. The notion here was we'd return
> > EAGAIN (or some other error) if it failed after Xms, and maybe do some emulation
> > in software for drives that don't support this. You'd tweak this number to
> > control performance. You're likely to get a much bigger performance win all the
> > time by scheduling I/O to drives that have the best recent latency.
> ZFS already does some latency based decisions.
> The things that you describe are very interesting, but they are for the future.
> > Third, do you have numbers that show this is actually a win?
> I do not have any numbers right now.
> What kind of numbers would you like? What kind of scenarios?
> The usual kind. How is latency for I/O improved when you have a disk with a few
> failing sectors that take a long time to read (which isn't a given: some sectors
> fail fast).
Today I gave an example of how four retries added about 9 seconds of additional
delay. I think that that is significant.
> What happens when you have a failed disk? etc. How does this compare
> with the current system.
I haven't done such an experiment. I guess it depends on how exactly the disk
fails. There is a big difference between a disk dropping a link and a disk
turning into a black hole.
> Basically, how do you know this will really make things better and isn't some
> kind of 'feel good' thing about 'doing something clever' about the problem that
> may actually make things worse.
> > This is a terrible
> > thing from an architectural view.
> You have said this several times, but unfortunately you haven't explained it
> I have explained it. You weren't listening.
This is the first time I see the below list or anything like it.
> 1. It breaks the EIO contract that's currently in place.
This needs further explanation.
> 2. It presumes to know what kind of retries should be done at the upper layers
> where today we have a system that's more black and white.
I don't understand this argument. If your upper level code does not know how to
do retries, then it should not concern itself with that and should not use the flag.
> You don't know the
> same info the low layers have to know whether to try another drive, or just
> retry this one.
Eh? Either we have different definitions of upper and lower layers or I don't
understand how lower layers (e.g. CAM) can know about another drive.
> 3. It assumes that retries are the source of latency in the system. they aren't
I am not assuming that at all for the general case.
> 4. It assumes retries are necessarily slow: they may be, they might not be. All
> depends on the drive (SSDs repeated I/O are often faster than actual I/O).
Of course. But X plus epsilon is always greater than X. And we know than in
many practical cases epsilon can be rather large.
> 5. It's just one bit when you really need more complex nuances to get good QoE
> out of the I/O system. Retries is an incidental detail that's not that
> important, while latency is what you care most about minimizing. You wouldn't
> care if I tried to read the data 20 times if it got the result faster than going
> to a different drive.
That's a good point. But then again, it's the upper layers that have a better
chance of predicting this kind of thing. That is, if I know that my backup
storage is extremely slow, then I will allow the fast primary storage do all
retries it wants to do. It's not CAM nor scsi_da nor a specific SIM that can
make those decisions. It's an issuer of the I/O request [or an intermediate
geom that encapsulates that knowledge and effectively acts as an issuer of I/O-s
to the lower geoms].
> 6. It's putting the wrong kind of specific hints into the mix.
This needs further explanation.
> > Absent numbers that show it's a big win, I'm
> > very hesitant to say OK.
> > Forth, there's a large number of places in the stack today that need to
> > communicate their I/O is more urgent, and we don't have any good way to
> > communicate even that simple concept down the stack.
> That's unfortunately, but my proposal has quite little to do with I/O
> scheduling, priorities, etc.
> Except it does. It dictates error recovery policy which is I/O scheduling.
> > Finally, the only places that ZFS uses the TRYHARDER flag are for things like
> > the super block if I'm reading the code right. It doesn't do it for normal I/O.
> Right. But for normal I/O there is ZIO_FLAG_IO_RETRY which is honored in the
> same way as ZIO_FLAG_TRYHARD.
> > There's no code to cope with what would happen if all the copies of a block
> > couldn't be read with the NORETRY flag. One of them might contain the data.
> ZFS is not that fragile :) see ZIO_FLAG_IO_RETRY above.
> Except TRYHARD in ZFS means 'don't fail ****OTHER**** I/O in the queue when an
> I/O fails' It doesn't control retries at all in Solaris. It's a different
> concept entirely, and one badly thought out.
I think that it does control retries.
And it does even more. My understanding is that bio-s with B_FAILFAST can be
failed immediately in the situation roughly equivalent to a CAM devq (or simq)
More information about the freebsd-fs