Help debugging DMA_READ errors

Wed Sep 17 05:43:33 UTC 2008

On Tue, Sep 16, 2008 at 04:16:55PM -0700, Clint Olsen wrote:
> On Sep 16, Jeremy Chadwick wrote:
> > That's very strange then.  Something definitely tried to utilise acd0 at
> > that hour of the night.  What is acd0 connected to, ATA-wise?  Again, I
> > assume it's PATA, but I'd like to know the primary/secondary and
> > master/slave organisation, since you are using a PATA disk too.
> 
> What's the best way to give you this?  Generally with disks I try to
> separate them from DVD/CD drives, so I don't think they are on the same
> chain.  Is the question whether or not the DVD/CD is a slave to the PATA
> disk?

Correct.  I wanted to see if it was on the same primary or secondary
controller as the ad0 disk which emitted errors.

> acd0: CDRW <Hewlett-Packard DVD Writer 100/1.37> at ata1-master UDMA33

...and it doesn't appear to be.  Taken from your previous mails:

 ad0: 114473MB <WDC WD1200JB-32EVA0 15.05R15> at ata0-master UDMA100
acd0: CDRW <Hewlett-Packard DVD Writer 100/1.37> at ata1-master UDMA33

What this confirms is that there are two separate PATA cables (one for
the ad0 disk, sitting on primary-master on IRQ 14, and one for the acd0
DVD drive, sitting on secondary-master on IRQ 15).

So that would mean, in the case of "bad cables", you would have *three*
separate cables (2xPATA, 1xSATA) which would all have gone bad at the
same time.  This is highly, highly unlikely.

> > Looks fine, although I swore ATA controllers listed their IRQs.  atapci0
> > doesn't appear to have an IRQ associated with it (should be 14 or 15),
> > so that's a little odd to me.  vmstat -i would help here.
>  
> interrupt                          total       rate
> irq1: atkbd0                          14          0
> irq6: fdc0                             1          0
> irq12: psm0                         1624          0
> irq14: ata0                       410187         14
> irq15: ata1                       225418          7
> irq18: uhci2+                     111881          3
> irq22: skc0                       260062          9
> cpu0: timer                     56551841       1999
> Total                           57561028       2035

IRQs sharing is in effect, despite an APIC being used.  But I doubt this
is an interrupt problem.  IRQ18 is also shared with at least one other
device; it's definitely shared with the USB controller, but the "+"
indicates there's even more devices associated with the IRQ.  Piecing
together things from previous mails:

 ad0 is on ata0 (which is atapci0, Intel ICH5 UDMA100 controller; IRQ 14)
acd0 is on ata1 (which is atapci0, Intel ICH5 UDMA100 controller; IRQ 15)
 ad4 is on ata2 (which is atapci1, Intel ICH5 SATA150 controller; IRQ 18)
 ad6 is on ata3 (which is atapci1, Intel ICH5 SATA150 controller; IRQ 18)

> > Okay, there are some problems with your disks, but it's going to be
> > impossible for me to determine if the below problems caused what you saw.
> > First, ad0:
> 
> I just freed up a 300G SATA disk, so I can swap out the PATA drive if you
> think it's worth the effort.

With regards to ad0, it's entirely your call.  I'm pedantic about bad
blocks, even if they've been remapped successfully, but that's just me.
Others are more relaxed about it all.

> > 1) Run "smartctl -t short" on /dev/ad0 and /dev/ad4.  You can safely use
> > the disks during this time.  After a few minutes (depends on how much
> > disk I/O is happening; the more I/O, the longer the test takes to
> > complete), you should see an entry in the SMART self-test log saying
> > Completed.  Once you see that, you should run smartctl -a on the disk
> > again, and see if the attributes labelled "Offline" are different than
> > they were before.
> > 
> > 2) Consider running smartd.  I do not normally advocate this, but in
> > your case, it may be the only way to see which attribute values are
> > actually changing on you if/when the issue happens again.  Any time a
> > value changes, it'll be logged via syslog.  You can set up smartd.conf
> > to ignore certain attributes (e.g. temperature, since that has a
> > tendency to fluctuate up and down a degree).
>  
> I'm looking at that.  The sample conf file that comes with it isn't the
> easiest on the eyes, so I haven't figure out what configuration I want or
> how to set it up yet.

The example configuration is overzealous with comments and is badly
formatted making it difficult to read.  The simple version:

If smartd sees the string DEVICESCAN (before any disk definitions),
it'll simply probe SMART stats periodically for all disks attached at
the time smartd was started.  (If disk definitions are seen first, then
it ignores DEVICESCAN from that point forward).  The problem with
DEVICESCAN is that you can't give each device its own flags (see below).

Each disk is configured on its own line in the config.  The flags you
can pass it do many different things (ignore certain changing attributes
(-I), send mail to an address on attribute change (-m), and many other
things -- see smartd.conf(5)).

> My external hard drive is running around 50 in that small external
> enclosure.  That sounds bad.
> 
> 190 Airflow_Temperature_Cel 0x0022   050   043   045    Old_age   Always In_the_past 50 (Lifetime Min/Max 32/53)
> 194 Temperature_Celsius     0x0022   050   057   000    Old_age   Always -       50 (0 21 0 0)

I covered this in another mail; yes, the temperature is of concern, but
it's not causing the DMA errors you're seeing on other disks.  :-)

-- 
| Jeremy Chadwick                                jdc at parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.              PGP: 4BD6C0CB |