Help debugging DMA_READ errors
koitsu at FreeBSD.org
Tue Sep 16 18:54:03 UTC 2008
On Tue, Sep 16, 2008 at 11:19:03AM -0700, Clint Olsen wrote:
> On Sep 16, Jeremy Chadwick wrote:
> > acd0 is a CD/DVD drive. ad4 is a hard disk. What exactly were you
> > doing with the system at the time these errors appeared? Were you using
> > the CD/DVD drive? Was there a disc in the drive that was mounted?
> > If none of these things, I'm baffled as to what would read acd0 and
> > cause what you see here.
> I was not at the system at the time. I never have had a disk in the drive
> nor is /cdrom mounted currently. I have dump backups that run in the
> middle of the night on the various filesystems.
That's very strange then. Something definitely tried to utilise acd0 at
that hour of the night. What is acd0 connected to, ATA-wise? Again, I
assume it's PATA, but I'd like to know the primary/secondary and
master/slave organisation, since you are using a PATA disk too.
> > Can you please provide full details of what these disks are connected
> > to? I'd like to see dmesg output for ata0, ata2, and ata3, as well as
> > the atapci devices those ataX devices are attached to, ditto with
> > vmstat -i output. Are there any other errors in your logs around
> > that time (e.g. watchdog timeouts of any kind on network devices, etc.?)
> # dmesg | grep -i ata
> atapci0: <Intel ICH5 UDMA100 controller> port 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xfc00-0xfc0f at device 31.1 on pci0
> ata0: <ATA channel 0> on atapci0
> ata1: <ATA channel 1> on atapci0
> atapci1: <Intel ICH5 SATA150 controller> port 0xeff0-0xeff7,0xefe4-0xefe7,0xefa8-0xefaf,0xefe0-0xefe3,0xef60-0xef6f irq 18 at device 31.2 on pci0
> ata2: <ATA channel 0> on atapci1
> ata3: <ATA channel 1> on atapci1
Looks fine, although I swore ATA controllers listed their IRQs. atapci0
doesn't appear to have an IRQ associated with it (should be 14 or 15),
so that's a little odd to me. vmstat -i would help here.
> See attached.
Okay, there are some problems with your disks, but it's going to be
impossible for me to determine if the below problems caused what you
saw. First, ad0:
> ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
> 5 Reallocated_Sector_Ct 0x0033 199 199 140 Pre-fail Always - 7
> 194 Temperature_Celsius 0x0022 111 253 000 Old_age Always - 39
> 196 Reallocated_Event_Count 0x0032 193 193 000 Old_age Always - 7
You have 7 bad sectors on ad0, and all 7 of those sectors have been
successfully remapped (meaning spare sectors on the disk (there is a
limited/reserved number of them) have been used).
Sector reallocation on ATA and SATA disks happens transparently, meaning
the OS is never told by the disk/controller what's going on behind the
scenes. However, reallocation can take time -- consider the situation
where a bad sector/block cannot be read or written to. The disk may
attempt to re-read/write that sector/block a few times before giving up,
marking it bad, and using a reallocated spare.
FreeBSD will begin spitting out DMA errors after 5 full seconds of not
receiving responses to ATA commands. So, if the above reallocations
caused what you saw, then the reallocations took >5 seconds.
Since you don't have SMART stats for your disks **before** this issue
began, we're not going to be able to determine if the reallocations are
what caused it. This is why some people run smartd.
That said: I do not see any sign of DMA read/write errors in the SMART
log. That's good. The temperature of this drive is also acceptable
(39C), which is also okay.
On to ad4:
> 194 Temperature_Celsius 0x0022 104 253 000 Old_age Always - 46
> 199 UDMA_CRC_Error_Count 0x000a 200 253 000 Old_age Always - 2
There are two CRC errors reported here, detected during DMA. These
could have resulted in data corruption on your drive; CRC errors cannot
correct data, it's just a simple way of determining if what data was
sent or received was intact.
However, there's no sign of DMA errors in the SMART log. I'm not sure
what to make of that; I really would expect there to be some.
Additionally, this disk is running at 46C, which borders on the high end
of warm. You may want to check cooling or general airflow around the
drive. I strongly doubt temperature is the problem here though.
Regarding both disks:
Some statistics are "offline", which means those attributes/stats will
not be updated until you run a short/long/offline SMART test. Do not
let the word "offline" make you think the drive is actually taken
offline -- it isn't.
I would recommend two things:
1) Run "smartctl -t short" on /dev/ad0 and /dev/ad4. You can safely use
the disks during this time. After a few minutes (depends on how much
disk I/O is happening; the more I/O, the longer the test takes to
complete), you should see an entry in the SMART self-test log saying
Completed. Once you see that, you should run smartctl -a on the disk
again, and see if the attributes labelled "Offline" are different than
they were before.
2) Consider running smartd. I do not normally advocate this, but in
your case, it may be the only way to see which attribute values are
actually changing on you if/when the issue happens again. Any time a
value changes, it'll be logged via syslog. You can set up smartd.conf
to ignore certain attributes (e.g. temperature, since that has a
tendency to fluctuate up and down a degree).
If/when this happens again, you should be able to look at your logs and
see what counters have changed. For example if you see something like
Power_Cycle_Count or Stop_Start_Count increase, you have disks which are
Welcome to the pain of debugging disk problems. :-)
| Jeremy Chadwick jdc at parodius.com |
| Parodius Networking http://www.parodius.com/ |
| UNIX Systems Administrator Mountain View, CA, USA |
| Making life hard for others since 1977. PGP: 4BD6C0CB |
More information about the freebsd-stable