[Bug 281528] SCSI tag-queue error with Samsung 870 EVO SSD [incl. workaround]
Date: Mon, 16 Sep 2024 06:46:39 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=281528 Bug ID: 281528 Summary: SCSI tag-queue error with Samsung 870 EVO SSD [incl. workaround] Product: Base System Version: Unspecified Hardware: Any OS: Any Status: New Keywords: cam, performance Severity: Affects Only Me Priority: --- Component: kern Assignee: bugs@FreeBSD.org Reporter: wbe@psr.com SUMMARY: Problem: A brand new, 4/2024 Samsung 870 EVO SSD connected via AHCI and SATA II to a Supermicro motherboard gets parity/CRC, ATA Status, and other errors. The SSD itself, via smartctl, reports no read/write or bad sector errors, just interface/CRC errors. Solution/workaround: "camcontrol negotiate $theSSD -T disable" to disable command queueing/tagging. DESCRIPTION: [This is an edited version of articles I posted to comp.unix.bsd.freebsd.misc.] On a system running FreeBSD 14.1-RELEASE (though I don't think that matters), I connected a Samsung 870 EVO SSD via AMD-AHCI and SATA II (3.0Gb/s). The SSD is rated for SATA III (6.0Gb/s). Temperature is fine (~29C). Lots of errors occurred (see log extracts below). ZFS, for example, got about 180 write errors while resilvering ~80GB to the new/empty drive. Most errors seemed to be retryable and succeeded on the second try. My reading of the error messages and the output from smartctl -x indicated some kind of interface problem. [Ignore the ada0/ada1 difference: that's my doing.] ---------- [sample log entries for some read errors:] [edited] Aug 21 03:01:24: (ada0:ahcich0:0:0:0): READ_FPDMA_QUEUED. ACB: 60 08 78 ff 64 40 13 00 00 00 00 00 Aug 21 03:01:24: (ada0:ahcich0:0:0:0): CAM status: Auto-Sense Retrieval Failed Aug 21 03:01:24: (ada0:ahcich0:0:0:0): Error 5, Unretryable error Aug 21 03:01:25: ahcich0: Timeout on slot 9 port 0 Aug 21 03:01:25: ahcich0: is 04000000 cs 00000200 ss 00000000 rs 00000200 tfd 451 serr 00400000 cmd 0000e917 Aug 21 03:01:25: (ada0:ahcich0:0:0:0): READ_FPDMA_QUEUED. ACB: 60 08 58 00 65 40 13 00 00 00 00 00 Aug 21 03:01:25: (ada0:ahcich0:0:0:0): CAM status: Auto-Sense Retrieval Failed Aug 21 03:01:25: (ada0:ahcich0:0:0:0): Error 5, Unretryable error Aug 21 03:01:25: (ada0:ahcich0:0:0:0): READ_FPDMA_QUEUED. ACB: 60 08 c0 36 65 40 13 00 00 00 00 00 Aug 21 03:01:25: (ada0:ahcich0:0:0:0): CAM status: ATA Status Error Aug 21 03:01:25: (ada0:ahcich0:0:0:0): ATA status: 00 () Aug 21 03:01:25: (ada0:ahcich0:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00 Aug 21 03:01:25: (ada0:ahcich0:0:0:0): Retrying command, 3 more tries remain Aug 21 03:01:25: (ada0:ahcich0:0:0:0): READ_FPDMA_QUEUED. ACB: 60 08 c8 ff 64 40 13 00 00 00 00 00 Aug 21 03:01:25: (ada0:ahcich0:0:0:0): CAM status: ATA Status Error Aug 21 03:01:25: (ada0:ahcich0:0:0:0): ATA status: 00 () Aug 21 03:01:25: (ada0:ahcich0:0:0:0): RES: 00 00 00 00 00 00 00 00 00 00 00 Aug 21 03:01:25: (ada0:ahcich0:0:0:0): Retrying command, 3 more tries remain Aug 21 03:01:25 ZFS[1332]: vdev I/O failure, path=/dev/ada0p3 offset=149417648128 size=4096 error=5 Aug 21 03:01:26: (ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 38 88 e4 80 40 13 00 00 00 00 00 Aug 21 03:01:26: (ada0:ahcich0:0:0:0): CAM status: Uncorrectable parity/CRC error Aug 21 03:01:26: (ada0:ahcich0:0:0:0): Retrying command, 3 more tries remain Aug 21 03:01:26: (ada0:ahcich0:0:0:0): READ_FPDMA_QUEUED. ACB: 60 08 10 48 29 40 05 00 00 00 00 00 Aug 21 03:01:26: (ada0:ahcich0:0:0:0): CAM status: Uncorrectable parity/CRC error Aug 21 03:01:26: (ada0:ahcich0:0:0:0): Retrying command, 3 more tries remain Aug 21 03:01:26: (ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 20 c0 e4 80 40 13 00 00 00 00 00 Aug 21 03:01:26: (ada0:ahcich0:0:0:0): CAM status: Uncorrectable parity/CRC error Aug 21 03:01:26: (ada0:ahcich0:0:0:0): Retrying command, 3 more tries remain. ---------- [sample log entries for the write errors during resilvering:] [edited] Aug 21 00:33:01: (ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 08 e0 b2 e7 40 02 00 00 00 00 00 Aug 21 00:33:01: (ada1:ahcich1:0:0:0): CAM status: Auto-Sense Retrieval Failed Aug 21 00:33:01: (ada1:ahcich1:0:0:0): Error 5, Unretryable error Aug 21 00:33:02: ahcich1: Timeout on slot 19 port 0 Aug 21 00:33:02: ahcich1: is 04000000 cs 00080000 ss 00000000 rs 00080000 tfd 451 serr 00400000 cmd 0000f317 Aug 21 00:33:02: (ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 48 f0 b2 e7 40 02 00 00 00 00 00 Aug 21 00:33:02: (ada1:ahcich1:0:0:0): CAM status: Auto-Sense Retrieval Failed Aug 21 00:33:02: (ada1:ahcich1:0:0:0): Error 5, Unretryable error Aug 21 00:33:02 ZFS[1322]: vdev I/O failure, path=/dev/ada1p3 offset=7774244864 size=36864 error=5 Aug 21 00:33:05: (ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 20 78 2e f4 40 02 00 00 00 00 00 Aug 21 00:33:05: (ada1:ahcich1:0:0:0): CAM status: Uncorrectable parity/CRC error Aug 21 00:33:05: (ada1:ahcich1:0:0:0): Retrying command, 3 more tries remain Aug 21 00:33:05: (ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 20 58 2e f4 40 02 00 00 00 00 00 Aug 21 00:33:05: (ada1:ahcich1:0:0:0): CAM status: Uncorrectable parity/CRC error Aug 21 00:33:05: (ada1:ahcich1:0:0:0): Retrying command, 3 more tries remain [end of log entries] ---------- Here's smartctl -x output, keeping only what looked "interesting"/relevant: SATA Version is: SATA 3.3, 6.0 Gb/s (current: 3.0 Gb/s) 199 CRC_Error_Count -OSRCK 099 099 000 - 64 235 POR_Recovery_Count -O--C- 099 099 000 - 7 241 Total_LBAs_Written -O--CK 099 099 000 - 213396105 0x06 0x018 4 64 --- Number of Interface CRC Errors [WBE note: the 65535+ numbers below may be the result of my not knowing about the "-F samsung2" option to smartctl at the time. Currently (as I submit this), those numbers are 0s.] SATA Phy Event Counters (GP Log 0x11) ID Size Value Description 0x0001 2 2 Command failed due to ICRC error 0x0002 2 0 R_ERR response for data FIS 0x0003 2 0 R_ERR response for device-to-host data FIS 0x0004 2 0 R_ERR response for host-to-device data FIS 0x0005 2 65535+ R_ERR response for non-data FIS 0x0006 2 65535+ R_ERR response for device-to-host non-data FIS 0x0007 2 0 R_ERR response for host-to-device non-data FIS 0x0008 2 0 Device-to-host non-data FIS retries 0x0009 2 5 Transition from drive PhyRdy to drive PhyNRdy 0x000a 2 5 Device-to-host register FISes sent due to a COMRESET 0x000b 2 0 CRC errors within host-to-device FIS 0x000d 2 65535+ Non-CRC errors within host-to-device FIS 0x000f 2 0 R_ERR response for host-to-device data FIS, CRC 0x0010 2 0 R_ERR response for host-to-device data FIS, non-CRC 0x0012 2 0 R_ERR response for host-to-device non-data FIS, CRC 0x0013 2 65535+ R_ERR response for host-to-device non-data FIS, non-CRC SCT Error Recovery Control: Read: Disabled Write: Disabled ---------- The significant lines from the errors above were: > Aug 21 03:01:24: (ada0:ahcich0:0:0:0): CAM status: Auto-Sense Retrieval Failed > Aug 21 03:01:24: (ada0:ahcich0:0:0:0): Error 5, Unretryable error ... > Aug 21 03:01:25: (ada0:ahcich0:0:0:0): CAM status: ATA Status Error ... > Aug 21 03:01:26: (ada0:ahcich0:0:0:0): CAM status: Uncorrectable parity/CRC error ... > 0x06 0x018 4 64 --- Number of Interface CRC Errors Results: * Test 1: It's not a bad data cable: Some people suggested the problem might be a bad cable. I ordered two new ones (SATA III). Tried both. Result: no improvement. It was cheap to try. * Test 2: leave queueing enabled and reduce the number of tags from 32 to 2. Didn't help: the errors continued to happen. * Fix 1: Disable command queueing ("camcontrol negotiate $theSSD -T disable"). * Fix 2: Connect the SSD with a USB-to-SATA adapter cable. Perhaps this works because there's no command queueing over USB? It was suggested that I post this here, as perhaps FreeBSD can add a quirk for these drives. Even if that's not appropriate, anyone else having this problem can now find this workaround here on bugzilla (current USENET articles are no longer archived by Google). Of course, Samsung may some day come out with new firmware that fixes this problem, in which case the quirk test might need to become "with firmware older than ____". HTH, -WBE -- You are receiving this mail because: You are the assignee for the bug.