ad0 READ_DMA TIMEOUT errors on install of 7.0-RELEASE

Wed Feb 27 19:05:09 UTC 2008

On Wed, Feb 27, 2008 at 10:32:48AM -0800, Stephen Hurd wrote:
> Booting the 6.3-RELEASE CD seems to make the problem go away... possibly 
> 7.0 stresses the HD more?

We don't know.  The author of the ATA subsystem is somewhat MIA, likely
busy with real-life things (jobs, etc.).  My main point was that you're
not alone with DMA timeouts and other oddities, but the reallocated
sector count being non-zero doesn't permit me to say "Yeah, you're
experiencing what others are".

>>> SMART Attributes Data Structure revision number: 16
>>> Vendor Specific SMART Attributes with Thresholds:
>>> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED 
>>>  WHEN_FAILED RAW_VALUE
>>>   5 Reallocated_Sector_Ct   0x0033   253   253   063    Pre-fail  Always  
>>>      -       4
>>
>> This shows you've had 4 reallocated sectors, meaning your disk does in
>> fact have bad blocks.  In 90% of the cases out there, bad blocks
>> continue to "grow" over time, due to whatever reason (I remember reading
>> an article explaining it, but I can't for the life of me find the URL).
>
> This is unusual now?  I've always "known" that a small number of bad blocks 
> is normal.  Time to readjust my knowledge again?

This isn't normal.  The realloc sector count in SMART, when a disk comes
out of the factory, is zero.  That number increases only when new
defects are found, and when those sectors are remapped to spares which
are available (there is a limited number of spares).  This is also
called a "grown defect list".

This isn't to be confused with what's called a "physical defect list",
which are known sectors/LBAs which are bad, straight out of the factory.
On ATA disks, the manufacturer stores the list in the drive and its not
modifiable via formatting or even a BIOS-based format (e.g. a SATA RAID
controller); some vendors do implement "low level formatting" via
undocumented ATA commands, which can erase that list, but that's besides
the point.  On SCSI disks, the physical defect list is readable and also
erasable via a low-level format, but SCSI disks also have a grown defect
list which is separate.

What I'm trying to say is that your disk already has 4 bad blocks that
the disk firmware itself is aware of, which means chances are there are
others which it hasn't figured out.  A high number of ECCs could
indicate that as well.

>>> 194 Temperature_Celsius     0x0032   253   253   000    Old_age   Always  
>>>      -       48
>>
>> This is excessive, and may be attributing to problems.  A hard disk
>> running at 48C is not a good sign.  This should really be somewhere
>> between high 20s and mid 30s.
>
> Yeah, this is a known problem with this drive... it's been running hot for 
> years.  I always figured it was due to the rotational speed increase in 
> commodity drives.

7200rpm disks shouldn't be running at 48C.  None of my 7200rpm disks, in
my barely-cooled FreeBSD box at home (e.g. two 1100rpm fans and that's
it) get anywhere near that.  36C is the highest they've seen -- and
there's 4 stacked right on top of one another.

Heck, on my disks, the SMART warning threshold (set by the manufacturer,
which is Western Digital) is 45C.

10krpm disks probably run hotter, but are not commodity.

>>> Error 2 occurred at disk power-on lifetime: 5171 hours (215 days + 11 
>>> hours)
>>>   When the command that caused the error occurred, the device was in an 
>>> unknown state.
>>> Error 1 occurred at disk power-on lifetime: 5171 hours (215 days + 11 
>>> hours)
>>>   When the command that caused the error occurred, the device was in an 
>>> unknown state.
>>
>> These are automated SMART log entries confirming the DMA failures.  The
>> fact that SMART saw them means that the disk is also aware of said
>> issues.  These may have been caused by the reallocated sectors.  It's
>> also interesting that the LBAs are different than the ones FreeBSD
>> reported issues with.
>
> If that power on lifetime is accurate, that was at least a year ago... but 
> I can't find any documentation as to when the power-on lifetime wraps or 
> what it actually indicates.  I'm assuming that it is total power on time 
> since the drive was manufactured.

Correct: it indicates how many hours the drive itself has been powered
on as an aggregate total.  E.g. if powered on for 48 hours, then shut
off for 3 hours, then powered on for another 7, the stat would read 55
hours.

> If it's total hours as a 16-bit integer, it shouldn't wrap.  Is there a way
> of getting the "current" power-on lifetime value that you're aware of?

I would have to go look at the SMART extension to ATA/SATA and find out
how large the counter is.  It probably varies from vendor to vendor too,
as SMART, despite being a standard, has a lot of "loose ends" in the
specification which vendors take advantage of.

> That power on minutes is interesting, but its current value is lower
> than the value at the error (but higher than the power uptime of the
> system):
>  9 Power_On_Minutes        0x0032   219   219   000    Old_age   Always     
>   -       1061h+40m

smartctl contains an internal database of what attributes map to what
drive model (that's what the "In smartctl database" message is about).
smartctl believes that your Maxtor disk stores the number of powered on
*minutes* in attribute 9, while other vendors store the number of
*hours* in attribute 9.  The smartctl(8) manpage outlines some of the
"one-offs" that are required to make smartctl show such counters
correctly, as they vary from vendor to vendor.   Look at the -v N,OPTION
flag.  You might consider trying '9.raw48' for attribute, to get it to
print the raw values.  Interpreting these values should really be punted
to the smartmontools-users list, though.  Bruce can probably help.

> Also interesting is that after getting more errors from FreeBSD, I did not 
> get more errors in smartctl.

Right, which goes back to what I said, re: this could indeed be a
FreeBSD issue, since others are reporting DMA timeouts with drives and
controllers that are guaranteed to be functional/working.

>> My advice to you is: replace the disk ASAP.  This problem will only get
>> worse.  Try another hard disk brand too (I don't have anything "against"
>> Maxtor, but usually its recommended to avoid a brand you have problems
>> with until the next time you have issues, then switch brands, etc.
>> etc...).  I'm very fond of Western Digital's SE16, RE, and RE2 series
>> currently.  But avoid Fujitsu and Samsung (both have a long track record
>> of having buggy drive firmwares, forcing vendors to make custom
>> workarounds for issues); stick with Seagate, Western Digital, or Maxtor.
>
> Yeah, that's my plan... but I wanted to stake out some whining rights in 
> advance so I can do the "But you said it was a bad HD or cable!  Now I'm 
> out $x00 and my system still doesn't work!  Help me or I switch to 
> DragonFly BSD/Desktop BSD/Linux which is perfect and has no problems!" 
> thing.  Then go on Slashdot and post long rambling messages about how 
> FreeBSD is dead and it doesn't matter than the manpages on any given Linux 
> box are useless.

Heh.  :-)  Well, it's all about troubleshooting I suppose.  There's no
guaranteed way to pinpoint what piece is responsible; that depressing
fact applies to most technology these days.  I can't even trust the term
"transport error" with SCSI mediums in this day and age; is it the
cable, the controller, a controller BIOS bug, bad terminator, or a buggy
OS?  Lots of time and money is required to track it all down.

If you replace the disk and you still continue to see DMA errors, then
my vote would be that you're experiencing the same thing others (and
myself, on one occasion) are.  I've done my best to bring this issue to
the attention of proper people in recent days, and that's all I can say
on the matter.

-- 
| Jeremy Chadwick                                    jdc at parodius.com |
| Parodius Networking                           http://www.parodius.com/ |
| UNIX Systems Administrator                      Mountain View, CA, USA |
| Making life hard for others since 1977.                  PGP: 4BD6C0CB |