Help debugging DMA_READ errors

Tue Sep 16 18:54:03 UTC 2008

On Tue, Sep 16, 2008 at 11:19:03AM -0700, Clint Olsen wrote:
> On Sep 16, Jeremy Chadwick wrote:
> > acd0 is a CD/DVD drive.  ad4 is a hard disk.  What exactly were you
> > doing with the system at the time these errors appeared?  Were you using
> > the CD/DVD drive?  Was there a disc in the drive that was mounted?
> > If none of these things, I'm baffled as to what would read acd0 and
> > cause what you see here.
> 
> I was not at the system at the time.  I never have had a disk in the drive
> nor is /cdrom mounted currently.  I have dump backups that run in the
> middle of the night on the various filesystems.

That's very strange then.  Something definitely tried to utilise acd0 at
that hour of the night.  What is acd0 connected to, ATA-wise?  Again, I
assume it's PATA, but I'd like to know the primary/secondary and
master/slave organisation, since you are using a PATA disk too.

> > Can you please provide full details of what these disks are connected
> > to?  I'd like to see dmesg output for ata0, ata2, and ata3, as well as
> > the atapci devices those ataX devices are attached to, ditto with
> > vmstat -i output.  Are there any other errors in your logs around
> > that time (e.g. watchdog timeouts of any kind on network devices, etc.?)
>  
> # dmesg | grep -i ata
> atapci0: <Intel ICH5 UDMA100 controller> port 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xfc00-0xfc0f at device 31.1 on pci0
> ata0: <ATA channel 0> on atapci0
> ata1: <ATA channel 1> on atapci0
> atapci1: <Intel ICH5 SATA150 controller> port 0xeff0-0xeff7,0xefe4-0xefe7,0xefa8-0xefaf,0xefe0-0xefe3,0xef60-0xef6f irq 18 at device 31.2 on pci0
> ata2: <ATA channel 0> on atapci1
> ata3: <ATA channel 1> on atapci1

Looks fine, although I swore ATA controllers listed their IRQs.  atapci0
doesn't appear to have an IRQ associated with it (should be 14 or 15),
so that's a little odd to me.  vmstat -i would help here.

> See attached.

Okay, there are some problems with your disks, but it's going to be
impossible for me to determine if the below problems caused what you
saw.  First, ad0:

> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
>   5 Reallocated_Sector_Ct   0x0033   199   199   140    Pre-fail  Always       -       7
> 194 Temperature_Celsius     0x0022   111   253   000    Old_age   Always       -       39
> 196 Reallocated_Event_Count 0x0032   193   193   000    Old_age   Always       -       7

You have 7 bad sectors on ad0, and all 7 of those sectors have been
successfully remapped (meaning spare sectors on the disk (there is a
limited/reserved number of them) have been used).

Sector reallocation on ATA and SATA disks happens transparently, meaning
the OS is never told by the disk/controller what's going on behind the
scenes.  However, reallocation can take time -- consider the situation
where a bad sector/block cannot be read or written to.  The disk may
attempt to re-read/write that sector/block a few times before giving up,
marking it bad, and using a reallocated spare.

FreeBSD will begin spitting out DMA errors after 5 full seconds of not
receiving responses to ATA commands.  So, if the above reallocations
caused what you saw, then the reallocations took >5 seconds.

Since you don't have SMART stats for your disks **before** this issue
began, we're not going to be able to determine if the reallocations are
what caused it.  This is why some people run smartd.

That said: I do not see any sign of DMA read/write errors in the SMART
log.  That's good.  The temperature of this drive is also acceptable
(39C), which is also okay.

On to ad4:

> 194 Temperature_Celsius     0x0022   104   253   000    Old_age   Always       -       46
> 199 UDMA_CRC_Error_Count    0x000a   200   253   000    Old_age   Always       -       2

There are two CRC errors reported here, detected during DMA.  These
could have resulted in data corruption on your drive; CRC errors cannot
correct data, it's just a simple way of determining if what data was
sent or received was intact.

However, there's no sign of DMA errors in the SMART log.  I'm not sure
what to make of that; I really would expect there to be some.

Additionally, this disk is running at 46C, which borders on the high end
of warm.  You may want to check cooling or general airflow around the
drive.  I strongly doubt temperature is the problem here though.

Regarding both disks:

Some statistics are "offline", which means those attributes/stats will
not be updated until you run a short/long/offline SMART test.  Do not
let the word "offline" make you think the drive is actually taken
offline -- it isn't.

I would recommend two things:

1) Run "smartctl -t short" on /dev/ad0 and /dev/ad4.  You can safely use
the disks during this time.  After a few minutes (depends on how much
disk I/O is happening; the more I/O, the longer the test takes to
complete), you should see an entry in the SMART self-test log saying
Completed.  Once you see that, you should run smartctl -a on the disk
again, and see if the attributes labelled "Offline" are different than
they were before.

2) Consider running smartd.  I do not normally advocate this, but in
your case, it may be the only way to see which attribute values are
actually changing on you if/when the issue happens again.  Any time a
value changes, it'll be logged via syslog.  You can set up smartd.conf
to ignore certain attributes (e.g. temperature, since that has a
tendency to fluctuate up and down a degree).

If/when this happens again, you should be able to look at your logs and
see what counters have changed.  For example if you see something like
Power_Cycle_Count or Stop_Start_Count increase, you have disks which are
losing power.

Welcome to the pain of debugging disk problems.  :-)

-- 
| Jeremy Chadwick                                jdc at parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.              PGP: 4BD6C0CB |