add BIO_NORETRY flag, implement support in ata_da, use in ZFS vdev_geom

Sat Nov 25 17:57:40 UTC 2017

On Sat, Nov 25, 2017 at 10:36 AM, Andriy Gapon <avg at freebsd.org> wrote:

>
> Timestamp of the first error is Jun 16 10:40:18.
> Timestamp of the last error is Jun 16 10:40:27.
> So, it took additional 9 seconds to finally produce EIO.
> That disk is a part of a ZFS mirror.  If the request was failed after the
> first
> attempt, then ZFS would be able to get the data from a good disk much
> sooner.
>
> And don't take me wrong, I do NOT want CAM or GEOM to make that decision by
> itself.  I want ZFS to be able to tell the lower layers when they should
> try as
> hard as they normally do and when they should report an I/O error as soon
> as it
> happens without any retries.

Let's walk through this. You see that it takes a long time to fail an I/O.
Perfectly reasonable observation. There's two reasons for this. One is that
the disks take a while to make an attempt to get the data. The second is
that the system has a global policy that's biased towards 'recover the
data' over 'fail fast'. These can be fixed by reducing the timeouts, or
lowing the read-retry count for a given drive or globally as a policy
decision made by the system administrator.

It may be perfectly reasonable to ask the lower layers to 'fail fast' and
have either a hard or a soft deadline on the I/O for a subset of I/O. A
hard deadline would return ETIMEDOUT or something when it's passed and
cancel the I/O. This gives better determinism in the system, but some
systems can't cancel just 1 I/O (like SATA drives), so we have to flush the
whole queue. If we get a lot of these, performance suffers. However, for
some class of drives, you know that if it doesn't succeed in 1s after you
submit it to the drive, it's unlikely to complete successfully and it's
worth the performance hit on a drive that's already acting up.

You could have a soft timeout, which says 'don't do any additional action
after X time has elapsed and you get word about this I/O. This is similar
to the hard timeout, but just stops retrying after the deadline has passed.
This scenario is better on the other users of the drive, assuming that the
read-recovery operations aren't starving them. It's also easier to
implement, but has worse worst case performance characteristics.

You aren't asking to limit retries. You're really asking to the I/O
subsystem to limit, where it can, the amount of time on an I/O so you can
try another one. You're means to doing this is to tell it not to retry.
That's the wrong means. It shouldn't be listed in the API that it's a 'NO
RETRY' request. It should be a QoS request flag: fail fast.

Part of why I'm being so difficult is that you don't understand this and
are proposing a horrible API. It should have a different name. The other
reason is that I  absolutely do not want to overload EIO. You must return a
different error back up the stack. You've show no interest in this past,
which is also a needless argument. We've given good reasons, and you've
poopooed them with bad arguments.

Also, this isn't the data I asked for. I know things can fail slowly. I was
asking for how it would improve systems running like this. As in "I
implemented it, and was able to fail over to this other drive faster" or
something like that. Actual drive failure scenarios vary widely, and
optimizing for this one failure is unwise. It may be the right
optimization, but it may not. There's lots of tricky edges in this space.

Warner