add BIO_NORETRY flag, implement support in ata_da, use in ZFS vdev_geom

Thu Dec 14 09:41:01 UTC 2017

This email is rather large, with a lot of contexts.
Replying just to a single piece here.

On 27/11/2017 17:29, Scott Long wrote:
> 
> 
>> On Nov 25, 2017, at 10:36 AM, Andriy Gapon <avg at FreeBSD.org> wrote:
>> Let's assume that I am talking about the case of not being able to read an HDD
>> sector that is gone bad.
>> Here is a real world example:
>>
>> Jun 16 10:40:18 trant kernel: ahcich0: NCQ error, slot = 20, port = -1
>> Jun 16 10:40:18 trant kernel: (ada0:ahcich0:0:0:0): READ_FPDMA_QUEUED. ACB: 60
>> 00 00 58 62 40 2c 00 00 08 00 00
>> Jun 16 10:40:18 trant kernel: (ada0:ahcich0:0:0:0): CAM status: ATA Status Error
>> Jun 16 10:40:18 trant kernel: (ada0:ahcich0:0:0:0): ATA status: 41 (DRDY ERR),
>> error: 40 (UNC )
>> Jun 16 10:40:18 trant kernel: (ada0:ahcich0:0:0:0): RES: 41 40 68 58 62 40 2c 00
>> 00 00 00
>> Jun 16 10:40:18 trant kernel: (ada0:ahcich0:0:0:0): Retrying command
>> Jun 16 10:40:20 trant kernel: ahcich0: NCQ error, slot = 22, port = -1
> ...
>> I do not see anything wrong in what CAM / ahci /ata_da did here.
>> They did what I would expect them to do.  They tried very hard to get data that
>> I told them I need.
> 
> Two things I see here.  The first is that the drive is trying for 2 seconds to get good
> data off of the media.  The second is that it’s failing and reporting the error as
> uncorrectable.  I think that retries at the OS/driver
> layer are therefore useless; the drive is already doing a bunch of its own retries and
> is failing, and is telling you that it’s failing.  In the past, the conventional wisdom has
> been to do retries, because 30 years ago drives had minimal firmware and didn’t do
> a good job at managing data integrity.  Now they do an extensive amount of
> analysis-driven error correction and retries, so I think it’s time to change the 
> conventional wisdom.  I’d propose that for direct-attach SSDs and HDDs we treat this
> error as non-retriable.  Normally this would be a one-line change, but I think it needs
> to be nuanced to distinguish between optical drives, simple flash media drives, and
> regular HDDs and SSDs.
> 
> An interim solution would be to just set the kern.cam.ada.retry_count to 0.

I went through some ada errors that I have in my logs and I think that there can
be a difference between HDDs and SSDs too.  I thought that the HDD internal
retry mechanism would be thorough enough, but, to my surprise, I see that
majority of read failures are recovered by the first retry.  Sometimes it's the
second retry that's successful, in all other cases the standard four retries do
not help.

So, it may be too early to set ada.retry_count to 0 for all types of supported
disks.

-- 
Andriy Gapon