Re: git: c01af41c3c8f - main - ata_da: add quirk to disable NCQ TRIM for Samsung 860/870 SSDs

From: Warner Losh <imp_at_bsdimp.com>
Date: Mon, 19 Feb 2024 17:00:33 UTC
On Mon, Feb 19, 2024 at 9:40 AM Andriy Gapon <avg@freebsd.org> wrote:

> On 19/02/2024 18:06, cglogic wrote:
> > Hello Warner,
> >
> > Thank you for detailed explanation. Looks like it has no sense for me to
> worry
> > about disabling NCQ TRIM on that old PC.
> >
> > However it would be great to have possibility to apply such hardware bug
> > workarounds for affected controller/drive pairs only.
> > Or, at least, have possibility to disable such quirks on per drive basis
> via
> > sysctl. I didn't find such a possibility.
>
> Yeah, we do not have a way to control ata_da (and scsi_da)  quirks via
> configuration.  It would be nice to have it, indeed.
>
> Also, perhaps what ADA_Q_NCQ_TRIM_BROKEN could do is prefer a different
> delete
> method by default.  That way, NCQ_DSM_TRIM option would not disappear
> completely
> and it would be possible to set with kern.cam.ada.N.delete_method sysctl.
>

Yea. In this case, though, the method is known to be bad. This badness
might not be apparent
until data is corrupted. Things seem to work. This is very dangerous to
allow, even with a
I know what I'm doing flag. We could do it, but I'm very uneasy given the
data corruption
that causes us to disable it in the first place...

Given what I read https://bugzilla.kernel.org/show_bug.cgi?id=203475 we
want NCQ Trim
disabled entirely for these drives. Sometimes they work, until they don't,
then people report
similar errors to what Andriy reported.

Warner


> > On Monday, February 19th, 2024 at 5:18 PM, Warner Losh <imp@bsdimp.com>
> wrote:
> >>
> >>
> >> On Mon, Feb 19, 2024, 4:29 AM cglogic <cglogic@protonmail.com
> >> <mailto:cglogic@protonmail.com>> wrote:
> >>
> >>     On Monday, February 19th, 2024 at 1:03 PM, Andriy Gapon
> <avg@FreeBSD.org>
> >>     wrote:
> >>
> >>     > On 19/02/2024 12:47, cglogic wrote:
> >>     >
> >>     > > Hello Andriy,
> >>     > >
> >>     > > I use ZFS with autotrim enabled on Samsung 860 PRO connected to
> Intel
> >>     AHCI SATA controller for 9 years without any issue.
> >>     >
> >>     >
> >>     > I think that it was released in 2018?
> >>
> >>     You are right, it's 9 years old computer, SSD was added later.
> >>
> >>     >
> >>     > > Can I disable this quirk locally or have I revert the patch to
> >>     continue use NCQ TRIM with this SSD?
> >>     >
> >>     >
> >>     > I don't think that there is a way to disable the quirk, so you'll
> have
> >>     to revert.
> >>     > Note that ATA TRIM is still enabled, it's only NCQ TRIM that got
> disabled.
> >>
> >>     From my understanding ATA TRIM is non-queued TRIM as opposed to NCQ
> TRIM.
> >>     With only ATA TRIM enabled, drive will have increased latency when
> large
> >>     amount of files deleted.
> >>
> >>     However I'm not sure how ZFS handles TRIMs, maybe it groups several
> TRIMs
> >>     and then sends when drive is idle.
> >>     Another question is how this affects TRIM enabled swap partition.
> >>
> >>     In general, if I'm correct here, disabling NCQ TRIM should not be
> an issue
> >>     for desktop usage, except for large git repository updates, etc.
> >>
> >>
> >> In general, it won't matter a ton. Unless you are seeing stalls because
> of
> >> read/write I/O waiting behind trims that have to go to the drive one at
> a
> >> time, the effects are in the noise. If you are seeing that, though, the
> >> effects can be quite helpful in terms of latency... assuming the drive
> is
> >> cooperative. The benefits of it vary a lot from drive to drive as well.
> Some
> >> early drives still serialize the trims, so the latency benefit is not
> so
> >> great. Some enterprise drives do trims instantly and keep some internal
> state
> >> that is helped a lot by them. But if NCQ trims are unstable, we likely
> should
> >> err on the side of disabling.
> >>
> >> The errors that Andriy was seeing typically are cable or controller
> issues. In
> >> his case, it was a controller issue (apparently, and surprisingly,
> paired with
> >> a specific type of drive). Linux has a specific workaround for the
> pair, which
> >> is unusual. But there's two issues at play.
> >>
> >> The first issue is that these drives should have NCQ Trim turned off,
> >> according to Linux's commit log. This may be firmware revision based or
> may
> >> just be some people are luckier with newer firmware (it's hard to say
> from the
> >> mixed reports, but they read more like luck than version for the
> sampling I
> >> did). So Andriy's change is good. The second issue, and one I was
> confused on
> >> earlier, is that Linux goes a step further and disables NCQ entirely
> for these
> >> old controllers for 860 and 870 drives. This isn't too surprising. We
> have a
> >> workaround for PMP for these controllers already.... One can disable
> NCQ
> >> entirely by setting the tags to 0... Given the specificity of the
> problem, and
> >> the age of the controllers, I'm inclined to not implement the Linux
> workaround
> >> for this specific pair.
>
>
> --
> Andriy Gapon
>
>