10.1 RC4 r273903 - zpool scrub on ssd mirror - ahci command timeout
Steven Hartland
killing at multiplay.co.uk
Wed Nov 5 23:45:42 UTC 2014
Looks like a HW issue, how is it connected to the controller and what is
the controller?
On 05/11/2014 23:32, Kai Gallasch wrote:
> Hi.
>
> Not sure if this is 10.1 related or more a problem of the ssd
> model and/or ahci controller..
>
> I am currently running 10.1 RC4 r273903 on a zfs on root server with two
> mirror pools. One of the pools is a mirror consisting of two Samsung
> SSD 850 PRO 512GB SSDs.
>
> When I start a zfs scrub on this pool the result of the scrub is:
>
> # zpool status -v ssdpool
> pool: ssdpool
> state: ONLINE
> status: One or more devices has experienced an unrecoverable error. An
> attempt was made to correct the error. Applications are
> unaffected. action: Determine if the device needs to be replaced, and
> clear the errors using 'zpool clear' or replace the device with 'zpool
> replace'. see: http://illumos.org/msg/ZFS-8000-9P
> scan: scrub repaired 73K in 0h8m with 0 errors on Thu Nov 6 00:00:16
> 2014 config:
>
> NAME STATE READ WRITE CKSUM
> ssdpool ONLINE 0 0 0
> mirror-0 ONLINE 0 0 0
> gpt/ssdpool0 ONLINE 0 0 17
> gpt/ssdpool1 ONLINE 0 0 29
>
> When I do a 'zpool clear' the pool status looks ok again. But when I
> again start a zpool scrub the same thing happens again and the
> above status "One or more devices has experienced an unrecoverable
> error" shows again.
>
>
> I find the following kernel message in the output of 'dmesg': (after
> running zpool scrub two times)
>
>
> ahcich2: Timeout on slot 15 port 0
> ahcich2: is 00000000 cs 000f0000 ss 000f8000 rs 000f8000 tfd 40 serr
> 00000000 cmd 0024cf17 (ada2:ahcich2:0:0:0): READ_FPDMA_QUEUED. ACB: 60
> 8b a6 1d 56 40 0d 00 00 00 00 00 (ada2:ahcich2:0:0:0): CAM status:
> Command timeout (ada2:ahcich2:0:0:0): Retrying command
> ahcich2: Timeout on slot 23 port 0
> ahcich2: is 00000000 cs 0f000000 ss 0f800000 rs 0f800000 tfd 40 serr
> 00000000 cmd 0024d817 (ada2:ahcich2:0:0:0): READ_FPDMA_QUEUED. ACB: 60
> 1b 23 81 bc 40 06 00 00 00 00 00 (ada2:ahcich2:0:0:0): CAM status:
> Command timeout (ada2:ahcich2:0:0:0): Retrying command
> ahcich2: Timeout on slot 3 port 0
> ahcich2: is 00000000 cs 00000030 ss 00000038 rs 00000038 tfd 40 serr
> 00000000 cmd 0024c317 (ada2:ahcich2:0:0:0): READ_FPDMA_QUEUED. ACB: 60
> 26 bd 18 8e 40 12 00 00 00 00 00 (ada2:ahcich2:0:0:0): CAM status:
> Command timeout (ada2:ahcich2:0:0:0): Retrying command
>
>
> Besides: smartctl shows no error on ada2.
> Here comes the output..
>
> # smartctl -a -q noserial /dev/ada2
> smartctl 6.3 2014-07-26 r3976 [FreeBSD 10.1-RC4 amd64] (local build)
> Copyright (C) 2002-14, Bruce Allen, Christian Franke,
> www.smartmontools.org
>
> === START OF INFORMATION SECTION ===
> Device Model: Samsung SSD 850 PRO 512GB
> Firmware Version: EXM01B6Q
> User Capacity: 512,110,190,592 bytes [512 GB]
> Sector Size: 512 bytes logical/physical
> Rotation Rate: Solid State Device
> Device is: Not in smartctl database [for details use: -P showall]
> ATA Version is: ACS-2, ATA8-ACS T13/1699-D revision 4c
> SATA Version is: SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)
> Local Time is: Thu Nov 6 00:02:04 2014 CET
> SMART support is: Available - device has SMART capability.
> SMART support is: Enabled
>
> === START OF READ SMART DATA SECTION ===
> SMART overall-health self-assessment test result: PASSED
>
> General SMART Values:
> Offline data collection status: (0x00) Offline data collection
> activity was never started.
> Auto Offline Data Collection:
> Disabled. Self-test execution status: ( 0) The previous
> self-test routine completed without error or no self-test has ever
> been run.
> Total time to complete Offline
> data collection: ( 0) seconds.
> Offline data collection
> capabilities: (0x53) SMART execute Offline
> immediate. Auto Offline data collection on/off support.
> Suspend Offline collection upon
> new command.
> No Offline surface scan
> supported. Self-test supported.
> No Conveyance Self-test
> supported. Selective Self-test supported.
> SMART capabilities: (0x0003) Saves SMART data before
> entering power-saving mode.
> Supports SMART auto save timer.
> Error logging capability: (0x01) Error logging supported.
> General Purpose Logging
> supported. Short self-test routine
> recommended polling time: ( 2) minutes.
> Extended self-test routine
> recommended polling time: ( 33) minutes.
> SCT capabilities: (0x003d) SCT Status supported.
> SCT Error Recovery Control
> supported. SCT Feature Control supported.
> SCT Data Table supported.
>
> SMART Attributes Data Structure revision number: 1
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE
> UPDATED WHEN_FAILED RAW_VALUE 5 Reallocated_Sector_Ct 0x0033 100
> 100 010 Pre-fail Always - 0 9 Power_On_Hours
> 0x0032 099 099 000 Old_age Always - 154 12
> Power_Cycle_Count 0x0032 099 099 000 Old_age
> Always - 5 177 Wear_Leveling_Count 0x0013 100 100
> 000 Pre-fail Always - 0 179 Used_Rsvd_Blk_Cnt_Tot
> 0x0013 100 100 010 Pre-fail Always - 0 181
> Program_Fail_Cnt_Total 0x0032 100 100 010 Old_age
> Always - 0 182 Erase_Fail_Count_Total 0x0032 100 100
> 010 Old_age Always - 0 183 Runtime_Bad_Block
> 0x0013 100 100 010 Pre-fail Always - 0 187
> Reported_Uncorrect 0x0032 100 100 000 Old_age
> Always - 0 190 Airflow_Temperature_Cel 0x0032 070 068
> 000 Old_age Always - 30 195 Hardware_ECC_Recovered
> 0x001a 200 200 000 Old_age Always - 0 199
> UDMA_CRC_Error_Count 0x003e 100 100 000 Old_age
> Always - 0 235 Unknown_Attribute 0x0012 100 100
> 000 Old_age Always - 0 241 Total_LBAs_Written
> 0x0032 099 099 000 Old_age Always - 400466433
>
> SMART Error Log Version: 1
> No Errors Logged
>
> SMART Self-test log structure revision number 1
> Num Test_Description Status Remaining
> LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed
> without error 00% 147 -
>
> SMART Selective self-test log data structure revision number 1
> SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
> 1 0 0 Not_testing
> 2 0 0 Not_testing
> 3 0 0 Not_testing
> 4 0 0 Not_testing
> 5 0 0 Not_testing
> Selective self-test flags (0x0):
> After scanning selected spans, do NOT read-scan remainder of disk.
> If Selective self-test is pending on power-up, resume after 0 minute
> delay.
>
> I wonder What is the possible reason for this. Both SSDs are new.
> Is this a common problem with zfs and SSDs (for example ahci timeouts
> because of high data rates for a bus ?)
>
> K.
>
More information about the freebsd-stable
mailing list