10.1 RC4 r273903 - zpool scrub on ssd mirror - ahci command timeout

Kai Gallasch k at free.de
Wed Nov 5 23:39:29 UTC 2014


Hi.

Not sure if this is 10.1 related or more a problem of the ssd
model and/or ahci controller..

I am currently running 10.1 RC4 r273903 on a zfs on root server with two
mirror pools. One of the pools is a mirror consisting of two Samsung
SSD 850 PRO 512GB SSDs.

When I start a zfs scrub on this pool the result of the scrub is:

# zpool status -v ssdpool
  pool: ssdpool
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
	attempt was made to correct the error.  Applications are
unaffected. action: Determine if the device needs to be replaced, and
clear the errors using 'zpool clear' or replace the device with 'zpool
replace'. see: http://illumos.org/msg/ZFS-8000-9P
  scan: scrub repaired 73K in 0h8m with 0 errors on Thu Nov  6 00:00:16
2014 config:

	NAME              STATE     READ WRITE CKSUM
	ssdpool           ONLINE       0     0     0
	  mirror-0        ONLINE       0     0     0
	    gpt/ssdpool0  ONLINE       0     0    17
	    gpt/ssdpool1  ONLINE       0     0    29

When I do a 'zpool clear' the pool status looks ok again. But when I
again start a zpool scrub the same thing happens again and the
above status "One or more devices has experienced an unrecoverable
error" shows again.


I find the following kernel message in the output of 'dmesg': (after
running zpool scrub two times)


ahcich2: Timeout on slot 15 port 0
ahcich2: is 00000000 cs 000f0000 ss 000f8000 rs 000f8000 tfd 40 serr
00000000 cmd 0024cf17 (ada2:ahcich2:0:0:0): READ_FPDMA_QUEUED. ACB: 60
8b a6 1d 56 40 0d 00 00 00 00 00 (ada2:ahcich2:0:0:0): CAM status:
Command timeout (ada2:ahcich2:0:0:0): Retrying command
ahcich2: Timeout on slot 23 port 0
ahcich2: is 00000000 cs 0f000000 ss 0f800000 rs 0f800000 tfd 40 serr
00000000 cmd 0024d817 (ada2:ahcich2:0:0:0): READ_FPDMA_QUEUED. ACB: 60
1b 23 81 bc 40 06 00 00 00 00 00 (ada2:ahcich2:0:0:0): CAM status:
Command timeout (ada2:ahcich2:0:0:0): Retrying command
ahcich2: Timeout on slot 3 port 0
ahcich2: is 00000000 cs 00000030 ss 00000038 rs 00000038 tfd 40 serr
00000000 cmd 0024c317 (ada2:ahcich2:0:0:0): READ_FPDMA_QUEUED. ACB: 60
26 bd 18 8e 40 12 00 00 00 00 00 (ada2:ahcich2:0:0:0): CAM status:
Command timeout (ada2:ahcich2:0:0:0): Retrying command


Besides: smartctl shows no error on ada2.
Here comes the output..

# smartctl -a -q noserial /dev/ada2
smartctl 6.3 2014-07-26 r3976 [FreeBSD 10.1-RC4 amd64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke,
www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     Samsung SSD 850 PRO 512GB
Firmware Version: EXM01B6Q
User Capacity:    512,110,190,592 bytes [512 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4c
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Thu Nov  6 00:02:04 2014 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection
activity was never started.
					Auto Offline Data Collection:
Disabled. Self-test execution status:      (   0)	The previous
self-test routine completed without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(    0) seconds.
Offline data collection
capabilities: 			 (0x53) SMART execute Offline
immediate. Auto Offline data collection on/off support.
					Suspend Offline collection upon
new command.
					No Offline surface scan
supported. Self-test supported.
					No Conveyance Self-test
supported. Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before
entering power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging
supported. Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 (  33) minutes.
SCT capabilities: 	       (0x003d)	SCT Status supported.
					SCT Error Recovery Control
supported. SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE
UPDATED  WHEN_FAILED RAW_VALUE 5 Reallocated_Sector_Ct   0x0033   100
100   010    Pre-fail  Always       -       0 9 Power_On_Hours
0x0032   099   099   000    Old_age   Always       -       154 12
Power_Cycle_Count       0x0032   099   099   000    Old_age
Always       -       5 177 Wear_Leveling_Count     0x0013   100   100
000    Pre-fail  Always       -       0 179 Used_Rsvd_Blk_Cnt_Tot
0x0013   100   100   010    Pre-fail  Always       -       0 181
Program_Fail_Cnt_Total  0x0032   100   100   010    Old_age
Always       -       0 182 Erase_Fail_Count_Total  0x0032   100   100
010    Old_age   Always       -       0 183 Runtime_Bad_Block
0x0013   100   100   010    Pre-fail  Always       -       0 187
Reported_Uncorrect      0x0032   100   100   000    Old_age
Always       -       0 190 Airflow_Temperature_Cel 0x0032   070   068
000    Old_age   Always       -       30 195 Hardware_ECC_Recovered
0x001a   200   200   000    Old_age   Always       -       0 199
UDMA_CRC_Error_Count    0x003e   100   100   000    Old_age
Always       -       0 235 Unknown_Attribute       0x0012   100   100
000    Old_age   Always       -       0 241 Total_LBAs_Written
0x0032   099   099   000    Old_age   Always       -       400466433

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining
LifeTime(hours)  LBA_of_first_error # 1  Extended offline    Completed
without error       00%       147         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute
delay.

I wonder What is the possible reason for this. Both SSDs are new.
Is this a common problem with zfs and SSDs (for example ahci timeouts
because of high data rates for a bus ?)

K.

-- 
PGP-KeyID = 0xE401B671927D4A5C


-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: not available
URL: <http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20141106/87579425/attachment.sig>


More information about the freebsd-stable mailing list