"ad0: TIMEOUT - WRITE_DMA" type errors with 7.0-RC1
Joe Peterson
joe at boulder.swri.edu
Fri Jan 25 12:05:51 PST 2008
Jeremy Chadwick wrote:
> What you've shown is usually the sign of a disk-related problem. It's
> very obvious when it's just one disk reporting DMA errors. You use ZFS,
> so chances are you have more than one disk in a pool/volume -- there's
> no indication ad1, ad4, ad6, etc. are failing, so this seems to indicate
> something specific to ad0.
Jeremy, thanks for the response - I have tried to answer all of your
questions below...
In my case, I am using only one disk (ad0) for FreeBSD, and I am only
using one partition on this disk in my ZFS pool. So, in this case,
unfortunately, it's not possible to tell from the fact that only ad0 is
listed that it is specific to this drive.
> Manufacturers pick very passive (non-aggressive) thresholds for error
> conditions on disks, so disks which are failing very commonly show
> "PASSED" during SMART analysis. To make matters worse, most users I
> know read SMART stats incorrectly (they're easy to misinterpret).
Yep, I am also always skeptical of smart reports. That's one reason I
am very interested in ZFS. I don't trust the drive to be completely
reliable, and the fact that ZFS does end-to-end data integrity is very
intriguing.
> Can you please provide output of the following:
>
> * smartctl -a /dev/ad0
OK, I've attached this to the end of this email.
> * atacontrol cap ad0
Protocol ATA/ATAPI revision 7
device model ST3500630A
serial number 9QG0DG03
firmware revision 3.AAE
cylinders 16383
heads 16
sectors/track 63
lba supported 268435455 sectors
lba48 supported 976773168 sectors
dma supported
overlap not supported
Feature Support Enable Value Vendor
write cache yes yes
read ahead yes yes
Tagged Command Queuing (TCQ) no no 0/0x00
SMART yes yes
microcode download yes yes
security yes no
power management yes yes
advanced power management no no 65278/0xFEFE
automatic acoustic management no no 0/0x00 208/0xD0
> * atacontrol info <ata0, ata1, etc. -- any controller used by ZFS>
Master: ad0 <ST3500630A/3.AAE> ATA/ATAPI revision 7
Slave: ad1 <ST3160812A/3.AAH> ATA/ATAPI revision 7
(but note that ad1 is not used by FreeBSD)
> * Relevant dmesg output that indicates what kind of ATA controller
> these disks are attached to. Start with output from 'ad0:' and
> work backwards. For example, ad0 on this machine is using an Intel
> ICH6 controller:
> atapci0: <Intel ICH6 SATA150 controller> port 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xf000-0xf00f at device 31.2 on pci0
> ata0: <ATA channel 0> on atapci0
> ad0: 238475MB <WDC WD2500KS-00MJB0 02.01C03> at ata0-master SATA150
atapci0: <Intel ICH4 UDMA100 controller> port
0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xf000-0xf00f at device 31.1 on pci0
ata0: <ATA channel 0> on atapci0
ata0: [ITHREAD]
ad0: 476940MB <Seagate ST3500630A 3.AAE> at ata0-master UDMA100
> SMART stats which are labelled "Offline" are only updated when a short
> or long offline test is performed. Have you tried using "smartctl -t
> short /dev/ad0" and "smartctl -t long /dev/ad0" to see if any of the raw
> values on the far right column increment?
I just tried one:
# 1 Short offline Completed without error 00% 5252
-
# 2 Short offline Completed without error 00% 5252
-
Also, none of the numbers that were zero incremented, esp:
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age
Offline - 0
Also, no more errors were reported in the system log during the self-tests.
> Have you tried using "zpool scrub" on the ZFS pool, then "zpool status"
> to see if READ/WRITE/CHKSUM counters increment or if the "scrub" line
> states there were errors?
OK, I started a scrub, and it will take some more time to complete...
But I get the following with status. Could this be due to the timeouts
and failures? I suspect so, so maybe this is not surprizing. I'd also
guess that this doesn't necessarily point to the drive, but anything in
the chain of events... I do not have a mirror or RADI-Z, so I guess the
reason there was "no data loss" (yet) is because the checksum passed,
and maybe it just had to retry...? Anyway, here's the output so far:
pool: tank
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://www.sun.com/msg/ZFS-8000-9P
scrub: scrub in progress, 2.50% done, 1h58m to go
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 1 3 0
ad0s1d ONLINE 1 3 0
errors: No known data errors
> Other things which have fixed problems in the past for others:
>
> * BIOS updates
> * Change of motherboards (sometimes replacing board with same model,
> other times going with a completely different vendor (implies weird
> implementation issues or BIOS problems))
I've been using this same motherboard/BIOS for a long time (as well as
this drive), so no changes have happened to the HW recently. The BIOS
is the newest, available, I believe (It's a Tyan Trinity S2099, so it's
a few years old)
> * Changing SATA cables
I'm using regular ATA 80-pin cables. Also, these seem to have been
working fine for quite a while now. But, yes, I have also witnessed bad
cable issues on older systems in the past. I certainly could try a new
cable and see if it helps.
> * Getting a larger power supply (usually when lots of disk are involved)
I only have two drives, so I think the PS has enough capacity in my case.
Anyway, thanks for the reply and further questions. Let me know if
anything I've sent back is helpful!
Thanks, Joe
-------------- next part --------------
smartctl version 5.37 [i386-portbld-freebsd7.0] Copyright (C) 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF INFORMATION SECTION ===
Model Family: Seagate Barracuda 7200.10 family
Device Model: ST3500630A
Serial Number: 9QG0DG03
Firmware Version: 3.AAE
User Capacity: 500,107,862,016 bytes
Device is: In smartctl database [for details use: -P show]
ATA Version is: 7
ATA Standard is: Exact ATA specification draft version not indicated
Local Time is: Fri Jan 25 09:55:13 2008 MST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 430) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 163) minutes.
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 114 071 006 Pre-fail Always - 82422948
3 Spin_Up_Time 0x0003 093 093 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 56
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 1
7 Seek_Error_Rate 0x000f 084 060 030 Pre-fail Always - 286126605
9 Power_On_Hours 0x0032 095 095 000 Old_age Always - 5250
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 59
187 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0
189 Unknown_Attribute 0x003a 100 100 000 Old_age Always - 0
190 Temperature_Celsius 0x0022 065 056 045 Old_age Always - 605749283
194 Temperature_Celsius 0x0022 035 044 000 Old_age Always - 35 (Lifetime Min/Max 0/15)
195 Hardware_ECC_Recovered 0x001a 063 046 000 Old_age Always - 166181300
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0000 100 253 000 Old_age Offline - 0
202 TA_Increase_Count 0x0032 100 253 000 Old_age Always - 0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
More information about the freebsd-stable
mailing list