[Bug 281528] SCSI tag-queue error with Samsung 870 EVO SSD [incl. workaround]

From: <bugzilla-noreply_at_freebsd.org>
Date: Mon, 16 Sep 2024 06:46:39 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=281528

            Bug ID: 281528
           Summary: SCSI tag-queue error with Samsung 870 EVO SSD [incl.
                    workaround]
           Product: Base System
           Version: Unspecified
          Hardware: Any
                OS: Any
            Status: New
          Keywords: cam, performance
          Severity: Affects Only Me
          Priority: ---
         Component: kern
          Assignee: bugs@FreeBSD.org
          Reporter: wbe@psr.com

SUMMARY:

Problem: A brand new, 4/2024 Samsung 870 EVO SSD connected via AHCI and
SATA II to a Supermicro motherboard gets parity/CRC, ATA Status, and other
errors.  The SSD itself, via smartctl, reports no read/write or bad sector
errors, just interface/CRC errors.

Solution/workaround: "camcontrol negotiate $theSSD -T disable" to disable
command queueing/tagging.


DESCRIPTION:

[This is an edited version of articles I posted to comp.unix.bsd.freebsd.misc.]

On a system running FreeBSD 14.1-RELEASE (though I don't think that matters),
I connected a Samsung 870 EVO SSD via AMD-AHCI and SATA II (3.0Gb/s).
The SSD is rated for SATA III (6.0Gb/s).  Temperature is fine (~29C).

Lots of errors occurred (see log extracts below).
ZFS, for example, got about 180 write errors while resilvering ~80GB to the
new/empty drive.
Most errors seemed to be retryable and succeeded on the second try.
My reading of the error messages and the output from smartctl -x
indicated some kind of interface problem. 

[Ignore the ada0/ada1 difference: that's my doing.]
----------
[sample log entries for some read errors:] [edited]

Aug 21 03:01:24: (ada0:ahcich0:0:0:0): READ_FPDMA_QUEUED. ACB: 60 08 78 ff 64
40 13 00 00 00 00 00
Aug 21 03:01:24: (ada0:ahcich0:0:0:0): CAM status: Auto-Sense Retrieval Failed
Aug 21 03:01:24: (ada0:ahcich0:0:0:0): Error 5, Unretryable error
Aug 21 03:01:25: ahcich0: Timeout on slot 9 port 0
Aug 21 03:01:25: ahcich0: is 04000000 cs 00000200 ss 00000000 rs 00000200 tfd
451 serr 00400000 cmd 0000e917
Aug 21 03:01:25: (ada0:ahcich0:0:0:0): READ_FPDMA_QUEUED. ACB: 60 08 58 00 65
40 13 00 00 00 00 00
Aug 21 03:01:25: (ada0:ahcich0:0:0:0): CAM status: Auto-Sense Retrieval Failed
Aug 21 03:01:25: (ada0:ahcich0:0:0:0): Error 5, Unretryable error
Aug 21 03:01:25: (ada0:ahcich0:0:0:0): READ_FPDMA_QUEUED. ACB: 60 08 c0 36 65
40 13 00 00 00 00 00
Aug 21 03:01:25: (ada0:ahcich0:0:0:0): CAM status: ATA Status Error
Aug 21 03:01:25: (ada0:ahcich0:0:0:0): ATA status: 00 ()
Aug 21 03:01:25: (ada0:ahcich0:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00
Aug 21 03:01:25: (ada0:ahcich0:0:0:0): Retrying command, 3 more tries remain
Aug 21 03:01:25: (ada0:ahcich0:0:0:0): READ_FPDMA_QUEUED. ACB: 60 08 c8 ff 64
40 13 00 00 00 00 00
Aug 21 03:01:25: (ada0:ahcich0:0:0:0): CAM status: ATA Status Error
Aug 21 03:01:25: (ada0:ahcich0:0:0:0): ATA status: 00 ()
Aug 21 03:01:25: (ada0:ahcich0:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00
Aug 21 03:01:25: (ada0:ahcich0:0:0:0): Retrying command, 3 more tries remain
Aug 21 03:01:25 ZFS[1332]: vdev I/O failure, path=/dev/ada0p3
offset=149417648128 size=4096 error=5
Aug 21 03:01:26: (ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 38 88 e4 80
40 13 00 00 00 00 00
Aug 21 03:01:26: (ada0:ahcich0:0:0:0): CAM status: Uncorrectable parity/CRC
error
Aug 21 03:01:26: (ada0:ahcich0:0:0:0): Retrying command, 3 more tries remain
Aug 21 03:01:26: (ada0:ahcich0:0:0:0): READ_FPDMA_QUEUED. ACB: 60 08 10 48 29
40 05 00 00 00 00 00
Aug 21 03:01:26: (ada0:ahcich0:0:0:0): CAM status: Uncorrectable parity/CRC
error
Aug 21 03:01:26: (ada0:ahcich0:0:0:0): Retrying command, 3 more tries remain
Aug 21 03:01:26: (ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 20 c0 e4 80
40 13 00 00 00 00 00
Aug 21 03:01:26: (ada0:ahcich0:0:0:0): CAM status: Uncorrectable parity/CRC
error
Aug 21 03:01:26: (ada0:ahcich0:0:0:0): Retrying command, 3 more tries remain.
----------
[sample log entries for the write errors during resilvering:] [edited]

Aug 21 00:33:01: (ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 08 e0 b2 e7
40 02 00 00 00 00 00
Aug 21 00:33:01: (ada1:ahcich1:0:0:0): CAM status: Auto-Sense Retrieval Failed
Aug 21 00:33:01: (ada1:ahcich1:0:0:0): Error 5, Unretryable error
Aug 21 00:33:02: ahcich1: Timeout on slot 19 port 0
Aug 21 00:33:02: ahcich1: is 04000000 cs 00080000 ss 00000000 rs 00080000 tfd
451 serr 00400000 cmd 0000f317
Aug 21 00:33:02: (ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 48 f0 b2 e7
40 02 00 00 00 00 00
Aug 21 00:33:02: (ada1:ahcich1:0:0:0): CAM status: Auto-Sense Retrieval Failed
Aug 21 00:33:02: (ada1:ahcich1:0:0:0): Error 5, Unretryable error
Aug 21 00:33:02 ZFS[1322]: vdev I/O failure, path=/dev/ada1p3 offset=7774244864
size=36864 error=5
Aug 21 00:33:05: (ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 20 78 2e f4
40 02 00 00 00 00 00
Aug 21 00:33:05: (ada1:ahcich1:0:0:0): CAM status: Uncorrectable parity/CRC
error
Aug 21 00:33:05: (ada1:ahcich1:0:0:0): Retrying command, 3 more tries remain
Aug 21 00:33:05: (ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 20 58 2e f4
40 02 00 00 00 00 00
Aug 21 00:33:05: (ada1:ahcich1:0:0:0): CAM status: Uncorrectable parity/CRC
error
Aug 21 00:33:05: (ada1:ahcich1:0:0:0): Retrying command, 3 more tries remain

[end of log entries]
----------

Here's smartctl -x output, keeping only what looked "interesting"/relevant:

SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 3.0 Gb/s)

199 CRC_Error_Count         -OSRCK   099   099   000    -    64
235 POR_Recovery_Count      -O--C-   099   099   000    -    7
241 Total_LBAs_Written      -O--CK   099   099   000    -    213396105

0x06  0x018  4              64  ---  Number of Interface CRC Errors

[WBE note: the 65535+ numbers below may be the result of my not knowing
about the "-F samsung2" option to smartctl at the time.  Currently (as I
submit this), those numbers are 0s.]

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x0001  2            2  Command failed due to ICRC error
0x0002  2            0  R_ERR response for data FIS
0x0003  2            0  R_ERR response for device-to-host data FIS
0x0004  2            0  R_ERR response for host-to-device data FIS
0x0005  2        65535+ R_ERR response for non-data FIS
0x0006  2        65535+ R_ERR response for device-to-host non-data FIS
0x0007  2            0  R_ERR response for host-to-device non-data FIS
0x0008  2            0  Device-to-host non-data FIS retries
0x0009  2            5  Transition from drive PhyRdy to drive PhyNRdy
0x000a  2            5  Device-to-host register FISes sent due to a COMRESET
0x000b  2            0  CRC errors within host-to-device FIS
0x000d  2        65535+ Non-CRC errors within host-to-device FIS
0x000f  2            0  R_ERR response for host-to-device data FIS, CRC
0x0010  2            0  R_ERR response for host-to-device data FIS, non-CRC
0x0012  2            0  R_ERR response for host-to-device non-data FIS, CRC
0x0013  2        65535+ R_ERR response for host-to-device non-data FIS, non-CRC

SCT Error Recovery Control:
           Read: Disabled
          Write: Disabled
----------

The significant lines from the errors above were:

> Aug 21 03:01:24: (ada0:ahcich0:0:0:0): CAM status: Auto-Sense Retrieval Failed
> Aug 21 03:01:24: (ada0:ahcich0:0:0:0): Error 5, Unretryable error
 ...
> Aug 21 03:01:25: (ada0:ahcich0:0:0:0): CAM status: ATA Status Error
 ...
> Aug 21 03:01:26: (ada0:ahcich0:0:0:0): CAM status: Uncorrectable parity/CRC error
 ...
> 0x06  0x018  4              64  ---  Number of Interface CRC Errors

Results:

* Test 1: It's not a bad data cable:
   Some people suggested the problem might be a bad cable.
   I ordered two new ones (SATA III).  Tried both.
   Result: no improvement.  It was cheap to try.

* Test 2: leave queueing enabled and reduce the number of tags from 32 to 2.
   Didn't help: the errors continued to happen.

* Fix 1: Disable command queueing ("camcontrol negotiate $theSSD -T disable").

* Fix 2: Connect the SSD with a USB-to-SATA adapter cable.
   Perhaps this works because there's no command queueing over USB?

It was suggested that I post this here, as perhaps FreeBSD can add a quirk for
these drives.  Even if that's not appropriate, anyone else having this problem
can now find this workaround here on bugzilla (current USENET articles are no
longer archived by Google).

Of course, Samsung may some day come out with new firmware that fixes this
problem, in which case the quirk test might need to become "with firmware older
than ____".

HTH,
 -WBE

-- 
You are receiving this mail because:
You are the assignee for the bug.