[Bug 277992] mpr and possible trim issues

From: <bugzilla-noreply_at_freebsd.org>
Date: Wed, 27 Mar 2024 16:00:06 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=277992

            Bug ID: 277992
           Summary: mpr and possible trim issues
           Product: Base System
           Version: 14.0-STABLE
          Hardware: amd64
                OS: Any
            Status: New
          Severity: Affects Some People
          Priority: ---
         Component: kern
          Assignee: bugs@FreeBSD.org
          Reporter: mike@sentex.net

The thread
https://lists.freebsd.org/archives/freebsd-hardware/2024-March/000094.html has
most of the details. 

In summary, a set of WD Blue SA510 SSDs with the latest firmware as of Mar 2024
will eventually start throwing errors and detach from the controller when I
copy and then destroy a zfs dataset with several million files.  It sort of
feels like a TRIM issue, but not sure.  Putting the disks off the onboard SATA
controller does not recreate the issue. 

If I start with a low level trim (trim -f /dev/daX), create a raidz1 zfs pool
with 4, one TB WD disks, import a dataset of about 280GB (compressed) that has
many (20+mill files), do a zfs send original pool | zfs recv copy-of-pool, then
zfs destroy copy-of-pool and repeat about 4 or 5 times, the drives in the pool
will start throwing errors.

If I do a hard trim of the disks, I can start from scratch and again get 4 or 5
cycles before the errors.  Hence, it sort of feels like a broken trim issue ?

I tried with auto trim on and off, a manual zfs trim <pool> between zfs send|
zfs recv tests to no avail. When the disks are on the mpr controller I will get
errors such as 
(da6:mpr0:0:16:0): READ(10). CDB: 28 00 6d e0 ae 28 00 00 08 00
(da6:mpr0:0:16:0): CAM status: CCB request completed with an error
(da6:mpr0:0:16:0): Retrying command, 3 more tries remain
(da6:mpr0:0:16:0): WRITE(10). CDB: 2a 00 0c cb 3f 00 00 00 e8 00
(da6:mpr0:0:16:0): CAM status: CCB request completed with an error
(da6:mpr0:0:16:0): Retrying command, 3 more tries remain
(da6:mpr0:0:16:0): READ(10). CDB: 28 00 6d e0 ad 28 00 01 00 00
(da6:mpr0:0:16:0): CAM status: CCB request completed with an error
(da6:mpr0:0:16:0): Retrying command, 3 more tries remain
(da6:mpr0:0:16:0): READ(10). CDB: 28 00 6d e0 ac 28 00 00 f8 00
(da6:mpr0:0:16:0): CAM status: CCB request completed with an error
(da6:mpr0:0:16:0): Retrying command, 3 more tries remain
(da6:mpr0:0:16:0): WRITE(10). CDB: 2a 00 40 07 df 88 00 01 00 00
(da6:mpr0:0:16:0): CAM status: CCB request completed with an error
(da6:mpr0:0:16:0): Retrying command, 3 more tries remain
(da6:mpr0:0:16:0): WRITE(10). CDB: 2a 00 3f 48 72 08 00 01 00 00
(da6:mpr0:0:16:0): CAM status: SCSI Status Error
(da6:mpr0:0:16:0): SCSI status: Check Condition
(da6:mpr0:0:16:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, 
or bus device reset occurred)
(da6:mpr0:0:16:0): Retrying command (per sense data)
mpr0: Controller reported scsi ioc terminated tgt 15 SMID 2036 loginfo 
31110f00
mpr0: Controller reported scsi ioc terminated tgt 15 SMID 637 loginfo 
31110f00
(da5:mpr0:0:15:0): WRITE(10). CDB: 2a 00 41 98 42 00 00 01 00 00
mpr0: Controller reported scsi ioc terminated tgt 15 SMID 1242 loginfo 
31110f00
mpr0: Controller reported scsi ioc terminated tgt 15 SMID 979 loginfo 
31110f00
mpr0: Controller reported scsi ioc terminated tgt 15 SMID 1243 loginfo 
31110f00
mpr0: Controller reported scsi ioc terminated tgt 15 SMID 2091 loginfo 
31110f00
mpr0: Controller reported scsi ioc terminated tgt 15 SMID 1612 loginfo 
31110f00
mpr0: Controller reported scsi ioc terminated tgt 15 SMID 2093 loginfo 
31110f00
mpr0: Controller reported scsi ioc terminated tgt 15 SMID 152 loginfo 
31110f00
mpr0: Controller reported scsi ioc terminated tgt 15 SMID 2132 loginfo 
31110f00
(da5:mpr0:0:15:0): CAM status: CCB request completed with an error
(da5:mpr0:0:15:0): Retrying command, 3 more tries remain
(da5:mpr0:0:15:0): WRITE(10). CDB: 2a 00 43 17 dc 88 00 01 00 00
(da5:mpr0:0:15:0): CAM status: CCB request completed with an error
(da5:mpr0:0:15:0): Retrying command, 3 more tries remain
(da5:mpr0:0:15:0): WRITE(10). CDB: 2a 00 41 98 43 00 00 00 50 00
(da5:mpr0:0:15:0): CAM status: CCB request completed with an error
(da5:mpr0:0:15:0): Retrying command, 3 more tries remain
(da5:mpr0:0:15:0): WRITE(10). CDB: 2a 00 0c d4 f6 80 00 00 68 00
(da5:mpr0:0:15:0): CAM status: CCB request completed with an error
(da5:mpr0:0:15:0): Retrying command, 3 more tries remain
(da5:mpr0:0:15:0): WRITE(10). CDB: 2a 00 0c d4 f5 80 00 01 00 00
(da5:mpr0:0:15:0): CAM status: CCB request completed with an error
(da5:mpr0:0:15:0): Retrying command, 3 more tries remain
(da5:mpr0:0:15:0): READ(10). CDB: 28 00 05 dc 12 28 00 00 f8 00
(da5:mpr0:0:15:0): CAM status: CCB request completed with an error
(da5:mpr0:0:15:0): Retrying command, 3 more tries remain
(da5:mpr0:0:15:0): READ(10). CDB: 28 00 05 dc 0f b0 00 00 88 00
(da5:mpr0:0:15:0): CAM status: CCB request completed with an error
(da5:mpr0:0:15:0): Retrying command, 3 more tries remain
(da5:mpr0:0:15:0): WRITE(10). CDB: 2a 00 02 96 7e 80 00 00 10 00
(da5:mpr0:0:15:0): CAM status: CCB request completed with an error
(da5:mpr0:0:15:0): Retrying command, 3 more tries remain
(da5:mpr0:0:15:0): READ(10). CDB: 28 00 6f 5b 8d 68 00 01 00 00
(da5:mpr0:0:15:0): CAM status: CCB request completed with an error
(da5:mpr0:0:15:0): Retrying command, 3 more tries remain
(da5:mpr0:0:15:0): WRITE(10). CDB: 2a 00 41 98 42 00 00 01 00 00
(da5:mpr0:0:15:0): CAM status: SCSI Status Error
(da5:mpr0:0:15:0): SCSI status: Check Condition
(da5:mpr0:0:15:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, 
or bus device reset occurred)
(da5:mpr0:0:15:0): Retrying command (per sense data)

The same tests with Samsung disks work without issue or at least I was not able
to recreate the error. 

# mprutil show adapter
mpr0 Adapter:
       Board Name: INSPUR 3008IT
   Board Assembly: INSPUR
        Chip Name: LSISAS3008
    Chip Revision: ALL
    BIOS Revision: 18.00.00.00
Firmware Revision: 16.00.12.00
  Integrated RAID: no
         SATA NCQ: ENABLED
 PCIe Width/Speed: x8 (8.0 GB/sec)
        IOC Speed: Full
      Temperature: 56 C


I originally ran into this problem with the same series of LSI adapter, but it
was not in IT mode and instead was using the mrsas driver.  

When on the ATA controller the disks are DSM_TRIM. When on MPR, they are
ATA_TRIM.

-- 
You are receiving this mail because:
You are the assignee for the bug.