"ad0: TIMEOUT - WRITE_DMA" type errors with 7.0-RC1

Fri Jan 25 13:30:55 PST 2008

On Fri, Jan 25, 2008 at 12:24:20PM -0700, Joe Peterson wrote:
> In my case, I am using only one disk (ad0) for FreeBSD, and I am only
> using one partition on this disk in my ZFS pool.  So, in this case,
> unfortunately, it's not possible to tell from the fact that only ad0 is
> listed that it is specific to this drive.

Ah ha.  Well, in your below example, you may only be using one drive for
FreeBSD (ad0), but you do have a 2nd drive (ad1) which is installed.
I would try doing some I/O on /dev/ad1 to see if you can get the
timeouts to occur on that drive as well.  You don't have to do anything
risky with ad1 either: dd if=/dev/ad1 of=/dev/null bs=64k would probably
suffice.

> Yep, I am also always skeptical of smart reports.  That's one reason I
> am very interested in ZFS.  I don't trust the drive to be completely
> reliable, and the fact that ZFS does end-to-end data integrity is very
> intriguing.

I agree entirely -- and I also use ZFS myself (across two drives in a
RAID0-like fashion, with a completely separate drive which is used for
nightly backups of the ZFS pool).  I'm absolutely thrilled with it;
finally something clean, reliable, and simple -- something I've always
wanted in a LVM or LVM-like implementation.

> > * smartctl -a /dev/ad0
> 
> OK, I've attached this to the end of this email.
>
> atapci0: <Intel ICH4 UDMA100 controller> port
> 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xf000-0xf00f at device 31.1 on pci0
> ata0: <ATA channel 0> on atapci0
> ata0: [ITHREAD]
> ad0: 476940MB <Seagate ST3500630A 3.AAE> at ata0-master UDMA100

The smartctl output for /dev/ad0 looks good, minus the one uncorrected
sector.  I'm ignoring that since it's proof that the drive knew of it
and remapped it.  If that number starts incrementing over time, though,
replace the drive ASAP, of course.

The atacontrol cap output looks fine too; nothing wonky, and the LBA
capabilities look fine.

The controller is nothing out-of-the-ordinary; it's reliable under
FreeBSD (I've had many a motherboard which used it).  Of course I
haven't used an ICH4 since FreeBSD 3.x, and the ATA layer has changed
substantially, numerous times.

> {regarding -t short and -t long}
> Also, none of the numbers that were zero incremented, esp:
> 
> 198 Offline_Uncorrectable   0x0010   100   100   000    Old_age
> Offline      -       0
> 
> Also, no more errors were reported in the system log during the self-tests.

Seem to indicate that the drive considers itself healthy.

Another test I could recommend at this point would be one that would
require a few hours of downtime: download Seagate's SeaTools (will
require a CD burner or floppies) and consider doing both "quick" and
"long" scans.  "Quick" checks some of the stuff we've looked at here,
but it also looks at some vendor-specific stuff within the drive.
"Long" will scan every block on the disk for errors (and will not
destroy data).

> OK, I started a scrub, and it will take some more time to complete...
> But I get the following with status.  Could this be due to the timeouts
> and failures?  I suspect so, so maybe this is not surprizing.

It depends on whether or not you saw more timeouts and cache errors spit
out by the kernel while "zpool scrub" ran.  If so, then yes, I would
definitely say they're related.

> I'd also guess that this doesn't necessarily point to the drive, but
> anything in the chain of events...  I do not have a mirror or RADI-Z,
> so I guess the reason there was "no data loss" (yet) is because the
> checksum passed, and maybe it just had to retry...?

I'm still new to ZFS myself, so I don't have an answer for you.  Your
conclusion is the same thing I'd conclude, though.

> I've been using this same motherboard/BIOS for a long time (as well as
> this drive), so no changes have happened to the HW recently.  The BIOS
> is the newest, available, I believe (It's a Tyan Trinity S2099, so it's
> a few years old)

I'd say the BIOS is probably not responsible at this point; I'd expect
other weird things to be going on with the system if the BIOS was broken
in some way (or possibly bit rot in the flash).

It's going to be difficult to determine if maybe something on the
mainboard has decided to start failing (some transistor within the ICH4,
etc...) though.  :-(

> I'm using regular ATA 80-pin cables.  Also, these seem to have been
> working fine for quite a while now.  But, yes, I have also witnessed bad
> cable issues on older systems in the past.  I certainly could try a new
> cable and see if it helps.

I'd try that for sure.  It's just one more thing to rule out.

> > * Getting a larger power supply (usually when lots of disk are involved)
> 
> I only have two drives, so I think the PS has enough capacity in my case.

Agreed; even a 350W PSU should handle 2 disks without a problem.

Here's something to ponder:

The LBAs being reported as having errors are scattered all over.  They
aren't lumped together (usually the sign of part of a platter going
bad); instead, they're all over the drive.

This would indicate either cable problems, motherboard/southbridge
problems, or possibly something on the drive PCB itself going bad.  The
drive PCB going bad is a sad reality -- but sometimes you can replace
them with a spare drive that's known to be good, and a Torx screwdriver
in most cases.  I've seen a lot of old Seagate SCSI drives which start
exhibiting random I/O errors which were fixed simply by the PCB being
replaced.  Bad cache/RAM on the PCB is my guess.

There's no sign of your drive actually spinning down or powering down in
any way (as you probably know, some drives will actually reset
themselves and re-spin up when encountering errors where the drive gets
"stuck" or is wedged in some way.  I don't know if this is a watchdog on
the drive, or if an error condition just causes the drive to reset), so
that's ruled out too.

My recommendation would be to, in this order:

* Replace the 80-pin ATA cable and see if it continues.

* Download SeaTools and let it do both quick and long scans.
  If the problem happens during either scan, then it's safe to say it's
  either a drive or MB/controller problem and FreeBSD isn't the problem.

* Worst-case scenario: purchase an identical drive and see if the
  problem continues with the new drive.  That would rule out the disk
  being the problem.

-- 
| Jeremy Chadwick                                    jdc at parodius.com |
| Parodius Networking                           http://www.parodius.com/ |
| UNIX Systems Administrator                      Mountain View, CA, USA |
| Making life hard for others since 1977.                  PGP: 4BD6C0CB |