add BIO_NORETRY flag, implement support in ata_da, use in ZFS vdev_geom

Mon Nov 27 15:29:22 UTC 2017

> On Nov 25, 2017, at 10:36 AM, Andriy Gapon <avg at FreeBSD.org> wrote:
> 
> On 25/11/2017 12:54, Scott Long wrote:
>> 
>> 
>>> On Nov 24, 2017, at 10:17 AM, Andriy Gapon <avg at FreeBSD.org> wrote:
>>> 
>>> 
>>>>> IMO, this is not optimal.  I'd rather pass BIO_NORETRY to the first read, get
>>>>> the error back sooner and try the other disk sooner.  Only if I know that there
>>>>> are no other copies to try, then I would use the normal read with all the retrying.
>>>>> 
>>>> 
>>>> I agree with Warner that what you are proposing is not correct.  It weakens the
>>>> contract between the disk layer and the upper layers, making it less clear who is
>>>> responsible for retries and less clear what “EIO” means.  That contract is already
>>>> weak due to poor design decisions in VFS-BIO and GEOM, and Warner and I
>>>> are working on a plan to fix that.
>>> 
>>> Well...  I do realize now that there is some problem in this area, both you and
>>> Warner mentioned it.  But knowing that it exists is not the same as knowing what
>>> it is :-)
>>> I understand that it could be rather complex and not easy to describe in a short
>>> email…
>>> 
>> 
>> There are too many questions to ask, I will do my best to keep the conversation
>> logical.  First, how do you propose to distinguish between EIO due to a lengthy
>> set of timeouts, vs EIO due to an immediate error returned by the disk hardware?
> 
> At what layer / component?
> If I am the issuer of the request then I know how I issued that request and what
> kind of request it was.  If I am an intermediate layer, then what do I care.
> 

I don’t understand your response, I think we are talking about 2 different things.

>> CAM has an extensive table-driven error recovery protocol who’s purpose is to
>> decide whether or not to do retries based on hardware state information that is
>> not made available to the upper layers.  Do you have a test case that demonstrates
>> the problem that you’re trying to solve?  Maybe the error recovery table is wrong
>> and you’re encountering a case that should not be retried.  If that’s what’s going on,
>> we should fix CAM instead of inventing a new work-around.
> 
> Let's assume that I am talking about the case of not being able to read an HDD
> sector that is gone bad.
> Here is a real world example:
> 
> Jun 16 10:40:18 trant kernel: ahcich0: NCQ error, slot = 20, port = -1
> Jun 16 10:40:18 trant kernel: (ada0:ahcich0:0:0:0): READ_FPDMA_QUEUED. ACB: 60
> 00 00 58 62 40 2c 00 00 08 00 00
> Jun 16 10:40:18 trant kernel: (ada0:ahcich0:0:0:0): CAM status: ATA Status Error
> Jun 16 10:40:18 trant kernel: (ada0:ahcich0:0:0:0): ATA status: 41 (DRDY ERR),
> error: 40 (UNC )
> Jun 16 10:40:18 trant kernel: (ada0:ahcich0:0:0:0): RES: 41 40 68 58 62 40 2c 00
> 00 00 00
> Jun 16 10:40:18 trant kernel: (ada0:ahcich0:0:0:0): Retrying command
> Jun 16 10:40:20 trant kernel: ahcich0: NCQ error, slot = 22, port = -1
...
> I do not see anything wrong in what CAM / ahci /ata_da did here.
> They did what I would expect them to do.  They tried very hard to get data that
> I told them I need.

Two things I see here.  The first is that the drive is trying for 2 seconds to get good
data off of the media.  The second is that it’s failing and reporting the error as
uncorrectable.  I think that retries at the OS/driver
layer are therefore useless; the drive is already doing a bunch of its own retries and
is failing, and is telling you that it’s failing.  In the past, the conventional wisdom has
been to do retries, because 30 years ago drives had minimal firmware and didn’t do
a good job at managing data integrity.  Now they do an extensive amount of
analysis-driven error correction and retries, so I think it’s time to change the 
conventional wisdom.  I’d propose that for direct-attach SSDs and HDDs we treat this
error as non-retriable.  Normally this would be a one-line change, but I think it needs
to be nuanced to distinguish between optical drives, simple flash media drives, and
regular HDDs and SSDs.

An interim solution would be to just set the kern.cam.ada.retry_count to 0.

> 
> Timestamp of the first error is Jun 16 10:40:18.
> Timestamp of the last error is Jun 16 10:40:27.
> So, it took additional 9 seconds to finally produce EIO.
> That disk is a part of a ZFS mirror.  If the request was failed after the first
> attempt, then ZFS would be able to get the data from a good disk much sooner.

It’s the fault of ZFS for not reading all members in parallel.  I ask people to
consider and possibly accept that ZFS is not perfect.

> 
> And don't take me wrong, I do NOT want CAM or GEOM to make that decision by
> itself.  I want ZFS to be able to tell the lower layers when they should try as
> hard as they normally do and when they should report an I/O error as soon as it
> happens without any retries.
> 

Again, I’m not sure I understand your answer.  It is the job of the disk drivers to
reliably complete storage requests, not to just try once and then assume that an
upper layer will figure things out.  As I said in my previous email, the upper layers,
and in this case I mean everything above the drivers and CAM, do not have the
error information to know why something failed, and thus have only limited
knowledge of what to do to remediate a failure.  The error recovery knowledge is
in CAM and the drivers.

It sounds like you’re proposing to make these simple, dumb layers and move the
error recovery up to ZFS. I understand what you’re saying about wanting this
being opt-in and not affecting
other filesystems, but I think that you are missing the point of what CAM does and
provides.  By definition the ‘da’ and ‘ada’ drivers provide reliable I/O transactions to
the drive.  That means that they handle error recovery, retries, etc.  If you do not
want reliable I/O transactions, then you should not use the da/ada drivers, and
instead should either write your own periph driver (let’s call it “zfsda”), or you
should have ZFS communicate with the disks via the pass driver.  That is what
userland tools like cdparanoia do when they want to bypass the traditional
operational characteristics and constraints of the standard periphs.

Think of it this way, would you propose a flag to a TCP socket that tells TCP to
not do retransmits?  No!  If you wanted that semantic then you’d use UDP
and invent your own retry/reliability layer.  I think it’s a big job to re-invent error
recovery into ZFS, while it would be relatively easy to
adapt the CAM retry policy to the behavior of modern disks, i.e. believe them
when they say that there’s an uncorrectable read error and don’t blindly attempt
a retry.

>> Second, what about disk subsystems that do retries internally, out of the control
>> of the FreeBSD driver?  This would include most hardware RAID controllers.
>> Should what you are proposing only work for a subset of the kinds of storage
>> systems that are available and in common use?
> 
> Yes.  I do not worry about things that are beyond my control.
> Those subsystems would behave as they do now.  So, nothing would get worse.

It gets worse because it builds inconsistency into the system and makes expectations
for behavior unclear.  That turns into more hacky and incorrect code over time.

> 
>> Third, let’s say that you run out of alternate copies to try, and as you stated
>> originally, that will force you to retry the copies that had returned EIO.  How
>> will you know when you can retry?  How will you know how many times you
>> will retry?  How will you know that a retry is even possible?
> 
> I am continuing to use ZFS as an example.  It already has all the logic built
> in.  If all vdev zio-s (requests to mirror members as an example) fail, then
> their parent zio (a logical read from the mirror) will be retried (by ZFS) and
> when ZFS retries it sets a special flag (ZIO_FLAG_IO_RETRY) on that zio and on
> its future child zio-s.
> 
> Essentially, my answer is you have to program it correctly, there is no magic.

Again, I think that you’re missing the point of what I’m asking.  All that ZFS
can see is “EIO”.  Maybe the EIO was generated because of an uncorrectable
media error, and all attempts to retry will be futile.  Maybe it was generated
because of a correctable error, and a retry might be useful.  There are many,
many, many error cases that are tested for an handled in CAM.  You are collapsing
them down to 1 case, and I don’t agree with that.

> 
>> Should the retries
>> be able to be canceled?
> 
> I think that this is an orthogonal question.
> I do not have any answer and I am not ready to discuss this at all.
> 

Please be ready to discuss this.

>> Why is overloading EIO so bad?  brelse() will call bdirty() when a BIO_WRITE
>> command has failed with EIO.  Calling bdirty() has the effect of retrying the I/O.
>> This disregards the fact that disk drivers only return EIO when they’ve decided
>> that the I/O cannot be retried.  It has no termination condition for the retries, and
>> will endlessly retry I/O in vain; I’ve seen this quite frequently.  It also disregards
>> the fact that I/O marked as B_PAGING can’t be retried in this fashion, and will
>> trigger a panic.  Because we pretend that EIO can be retried, we are left with
>> a system that is very fragile when I/O actually does fail.  Instead of adding
>> more special cases and blurred lines, I want to go back to enforcing strict
>> contracts between the layers and force the core parts of the system to respect
>> those contracts and handle errors properly, instead of just retrying and
>> hoping for the best.
> 
> So, I suggest that the buffer layer (all the b* functions) does not use the
> proposed flag.  Any problems that exist in it should be resolved first.
> ZFS does not use that layer.
> 

I think it’s foolish to propose a change in the disk API and then say “only
ZFS should use this”

>>> But then, this flag is optional, it's off by default and no one is forced to
>>> used it.  If it's used only by ZFS, then it would not be horrible.
>>> Unless it makes things very hard for the infrastructure.
>>> But I am circling back to not knowing what problem(s) you and Warner are
>>> planning to fix.
>>> 
>> 
>> Saying that a feature is optional means nothing; while consumers of the API
>> might be able to ignore it, the producers of the API cannot ignore it.  It is
>> these producers who are sick right now and should be fixed, instead of
>> creating new ways to get even more sick.
> 
> I completely agree.
> But which producers of the API do you mean specifically?
> So far, you mentioned only the consumer level problems with the b-layer.
> 
> Having said all of the above, I must admit one thing.
> When I proposed BIO_NORETRY I had only the simplest GEOM topology in mind:
> ZFS -> [partition] -> disk.
> Now that I start to think about more complex topologies I am not really sure how
> the flag should be handled by geom-s with complex internal behavior.  If that
> can be implemented reasonably and clearly, if the flag will create a big mess.
> E.g., things like gmirrors on top of gmirrors and so on.
> Maybe the flag, if it ever accepted, should never be propagated automatically.
> Only geom-s that are aware of it should propagate or request it.  That should be
> safer.
> 

Yes, it gets chaotic and unreliable once you have multiple layers each thinking
that they know better than the previous layer.  That is the situation that FreeBSD
is currently in with CAM, GEOM, and VFS-BIO.  I do not propose to make it even
more chaotic.  Through just basic rules of abstraction, basic computer science
principles, it follows that the layers of software closest to the operation (in this
case, the hardware) have the least abstraction and the most amount of
information on how to act.  Saying that ZFS knows best how the hardware works,
when it has no access to the actual control channel information of the hardware,
is confusing and illogical to me.

I think that there’s a legitimate problem with CAM error recovery trying too hard
to do retries when the disk is clearly telling it that it’s futile.  I don’t have a quick
patch for that because I would like to think how to make it dependent on the
media type.  However, it’s something that I’m willing to discuss further and work
on.

Scott