Read / write timeouts on SATA disks connected to ICH9

Pieter de Boer pieter at os3.nl
Fri May 14 21:09:32 UTC 2010


>> My question: does anyone have experience with FreeBSD on a Dell R300
>> or can anyone give me some help in trying to fix the timeouts?
> 
> Could you please do the following:
> 
> - Provide output from "vmstat -i"
> 
> - Provide output from "dmesg | grep -i ata"
> 
> - Install ports/sysutils/smartmontools (5.40 or later) and provide
>   full output from commands "smartctl -a /dev/ad4" and "smartctl -a
>   /dev/ad6"

The ad4 SMART output is showing errors, as this disk is indeed broken 
now. It wasn't before and it is a replacement of another disk that 
wasn't broken either. Grmbl, I now see reallocated sectors on ad6 as 
well, in the smartctl output. So both disks look wonky; although afaik 
that's not the main issue here.

I've attached the smartctl output as separate files. smartmontools 5.40 
does not appear to exist; I used 5.39.1, the latest port version.

Attached also the vmstat -i and dmesg output.

-- 
Pieter
-------------- next part --------------
smartctl 5.39.1 2010-01-28 r3054 [FreeBSD 8.0-RELEASE-p1 i386] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar Black family
Device Model:     WDC WD5001AALS-00L3B2
Serial Number:    WD-WCASYA964063
Firmware Version: 01.03B01
User Capacity:    500,107,862,016 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Fri May 14 23:01:49 2010 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x85)	Offline data collection activity
					was aborted by an interrupting command from host.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      ( 241)	Self-test routine in progress...
					10% of test remaining.
Total time to complete Offline 
data collection: 		 (11160) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 131) minutes.
Conveyance self-test routine
recommended polling time: 	 (   5) minutes.
SCT capabilities: 	       (0x3037)	SCT Status supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       78
  3 Spin_Up_Time            0x0027   184   168   021    Pre-fail  Always       -       3791
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       992
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       827
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       990
192 Power-Off_Retract_Count 0x0032   199   199   000    Old_age   Always       -       989
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       992
194 Temperature_Celsius     0x0022   125   109   000    Old_age   Always       -       22
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   198   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       1
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
Warning: ATA error count 48 inconsistent with error log pointer 1

ATA Error Count: 48 (device log contains only the most recent five errors)
	CR = Command Register [HEX]
	FR = Features Register [HEX]
	SC = Sector Count Register [HEX]
	SN = Sector Number Register [HEX]
	CL = Cylinder Low Register [HEX]
	CH = Cylinder High Register [HEX]
	DH = Device/Head Register [HEX]
	DC = Device Command Register [HEX]
	ER = Error register [HEX]
	ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 48 occurred at disk power-on lifetime: 817 hours (34 days + 1 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 9d 84 0e e0  Error: UNC at LBA = 0x000e849d = 951453

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 20 80 84 0e 00 00      00:45:18.204  READ DMA
  c8 00 20 60 84 0e 00 00      00:45:18.204  READ DMA
  c8 00 20 40 84 0e 00 00      00:45:18.204  READ DMA
  c8 00 20 20 84 0e 00 00      00:45:18.204  READ DMA
  c8 00 20 00 84 0e 00 00      00:45:18.204  READ DMA

Error 47 occurred at disk power-on lifetime: 817 hours (34 days + 1 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 0c 9d 0e e0  Error: UNC at LBA = 0x000e9d0c = 957708

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 80 00 9d 0e 00 00      00:03:08.614  READ DMA
  c8 00 80 80 9c 0e 00 00      00:03:08.611  READ DMA
  c8 00 80 00 9c 0e 00 00      00:03:08.610  READ DMA
  c8 00 80 80 9b 0e 00 00      00:03:08.606  READ DMA
  c8 00 80 00 9b 0e 00 00      00:03:08.605  READ DMA

Error 46 occurred at disk power-on lifetime: 817 hours (34 days + 1 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 9d 84 0e e0  Error: UNC at LBA = 0x000e849d = 951453

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 80 80 84 0e 00 00      00:03:05.179  READ DMA
  c8 00 80 00 84 0e 00 00      00:03:05.178  READ DMA
  c8 00 80 80 83 0e 00 00      00:03:05.177  READ DMA
  c8 00 80 00 83 0e 00 00      00:03:05.177  READ DMA
  c8 00 80 80 82 0e 00 00      00:03:05.176  READ DMA

Error 45 occurred at disk power-on lifetime: 817 hours (34 days + 1 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 08 20 47 6c e0  Error: UNC at LBA = 0x006c4720 = 7096096

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c4 ff 08 1f 47 6c 00 00      00:01:09.467  READ MULTIPLE
  c4 ff 08 17 47 6c 00 00      00:01:09.465  READ MULTIPLE
  c4 ff 08 0f 47 6c 00 00      00:01:09.463  READ MULTIPLE
  c4 ff 08 07 47 6c 00 00      00:01:09.461  READ MULTIPLE
  c4 ff 08 ff 46 6c 00 00      00:01:09.459  READ MULTIPLE

Error 44 occurred at disk power-on lifetime: 817 hours (34 days + 1 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 08 21 8e 67 e0  Error: UNC at LBA = 0x00678e21 = 6786593

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c4 ff 08 1f 8e 67 00 00      00:01:00.748  READ MULTIPLE
  c4 ff 08 67 5f 67 00 00      00:01:00.746  READ MULTIPLE
  c4 ff 04 9f 5e 67 00 00      00:01:00.743  READ MULTIPLE
  c4 ff 08 5f 5f 67 00 00      00:01:00.728  READ MULTIPLE
  c4 ff 04 3f 2f 00 00 00      00:01:00.724  READ MULTIPLE

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%         2         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
-------------- next part --------------
smartctl 5.39.1 2010-01-28 r3054 [FreeBSD 8.0-RELEASE-p1 i386] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda ES.2
Device Model:     ST3500320NS
Serial Number:    9QMC8GS0
Firmware Version: SN06
User Capacity:    500,107,862,016 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Fri May 14 23:01:52 2010 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		 ( 650) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 ( 120) minutes.
Conveyance self-test routine
recommended polling time: 	 (   2) minutes.
SCT capabilities: 	       (0x103d)	SCT Status supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   080   064   044    Pre-fail  Always       -       108618290
  3 Spin_Up_Time            0x0003   099   099   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       13
  5 Reallocated_Sector_Ct   0x0033   098   098   036    Pre-fail  Always       -       50
  7 Seek_Error_Rate         0x000f   072   060   030    Pre-fail  Always       -       18136475
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       826
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       13
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   099   099   000    Old_age   Always       -       1
188 Command_Timeout         0x0032   100   099   000    Old_age   Always       -       131074
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   079   065   045    Old_age   Always       -       21 (Lifetime Min/Max 21/22)
194 Temperature_Celsius     0x0022   021   040   000    Old_age   Always       -       21 (0 19 0 0)
195 Hardware_ECC_Recovered  0x001a   053   035   000    Old_age   Always       -       108618290
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%         2         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
-------------- next part --------------
# vmstat -i
interrupt                          total       rate
irq4: uart0                         1325          0
irq21: uhci0 uhci+                 17806          0
irq23: atapci0                 371021299      10423
cpu0: timer                     71159899       1999
irq256: bge0                     1471004         41
cpu1: timer                     71165128       1999
Total                          514836461      14463
-------------- next part --------------
atapci0: <Intel ICH9 SATA300 controller> port 0xdc20-0xdc27,0xdc10-0xdc13,0xdc28-0xdc2f,0xdc14-0xdc17,0xdc40-0xdc4f,0xdc50-0xdc5f irq 23 at device 31.2 on pci0
atapci0: [ITHREAD]
ata2: <ATA channel 0> on atapci0
ata2: [ITHREAD]
ata3: <ATA channel 1> on atapci0
ata3: [ITHREAD]
atapci1: <Intel ICH9 SATA300 controller> port 0xdc30-0xdc37,0xdc18-0xdc1b,0xdc38-0xdc3f,0xdc1c-0xdc1f,0xdc60-0xdc6f,0xdc70-0xdc7f irq 22 at device 31.5 on pci0
atapci1: [ITHREAD]
ata4: <ATA channel 0> on atapci1
ata4: [ITHREAD]
ata5: <ATA channel 1> on atapci1
ata5: [ITHREAD]
ata0 at port 0x1f0-0x1f7,0x3f6 irq 14 on isa0
ata0: [ITHREAD]
ata1 at port 0x170-0x177,0x376 irq 15 on isa0
ata1: [ITHREAD]
ad4: 476940MB <WDC WD5001AALS-00L3B2 01.03B01> at ata2-master SATA300
acd0: DVDROM <TEAC DVD-ROM DV28SV/D.0J> at ata2-slave SATA150
ad6: 476940MB <Seagate ST3500320NS SN06> at ata3-master SATA300


More information about the freebsd-stable mailing list