SETFEATURES SET TRANSFER MODE taskqueue timeout.. Error occuring constantly.. Please help!!

Sat Oct 18 14:25:45 PDT 2008

On Sun, Oct 19, 2008 at 03:32:29AM +1100, Kristian Rooke wrote:
> Thanks for the quick response!
> 
> Please see requested output below:

Cool, thanks.  One thing I forgot to ask for was "vmstat -i" output.

For now, let's break it down for ease of understanding:

FreeBSD 7.0-RELEASE i386, built February 2008.

atapci0: nVidia nForce MCP73 ATA133 controller -- IRQ 14
atapci1: Silicon Image 0680 ATA133 controller  -- IRQ 16

ata0: attached to atapci0
ata1: attached to atapci0
ata2: attached to atapci1
ata3: attached to atapci1

ad0: <Seagate ST380011A 3.06>   at ata0-master PIO4
ad4: <Seagate ST3320620A 3.AAF> at ata2-master PIO4
ad5: <Seagate ST3320620A 3.AAF> at ata2-slave  PIO4
ad6: <Seagate ST3750640A 3.AAE> at ata3-master PIO4
ad7: <Seagate ST3320620A 3.AAD> at ata3-slave  PIO4

ATA errors are reported for disks ad4, ad5, ad6, and ad7.  ad0 appears
to be error-free.

First and foremost: there are known problems with Silicon Image
controllers on all operating systems (Windows, Linux, and FreeBSD in
particular), known for causing data loss and other sporadic issues.
This is at least confirmed on their SATA controllers, and I've become
quite the "pick something else" advocate when it comes to their stuff.
However: I've no idea about their PATA controllers.

Secondly, so far there isn't any evidence that the ad0 disk, which uses
the nVidia controller, has any problem -- all the disks having problems
are on the Silicon Image controller.  That is a very key piece of
information here.

If when you're writing data to, say, the ad4 disk, and you start to see
errors on all disks (ad4 through ad7), then what this probably means is
the controller has locked up or is behaving badly.  This adds further
evidence that the Silicon Image controller may be at fault here.

Thirdly, you said the system requires a hard reset to get things back in
working order.  Sometimes this can be induced by a power supply that
isn't providing decent/proper voltages, or is being overloaded,
particularly during heavy disk I/O (drawing more power in some cases).
It might be good to check your voltages inside of your system BIOS,
write them down, and type them in here.  FreeBSD does not provide a
decent set of tools for monitoring this stuff inside the OS (yet; I'm
working on it, mainly for server boards.  I do what I can...)

But keep in mind that a controller locking up hard could also require a
hard reset (pressing reset on the front of the PC) -- a soft reset
(Ctrl-Alt-Del) would probably work, except much of the running kernel is
spinning hard trying to deal with ATA problems.

Fourthly, I see a "<some output omitted>" line in your original dmesg.
Can you provide that output?  It's important -- sometimes people have
seen issues where their ATA controller shows problems, but it turns out
to be an IRQ sharing or device compatibility problem with another device
(e.g. their board was showing ATA errors, but at the exact same time,
also showing NIC watchdog timeouts or other anomalies).  They omitted
the dmesg data thinking it had nothing to do with the problem, when in
fact it helps determine if the issue is truly with one piece or the
entire system.

Next, let's take a look at your SMART output, which tells a tale of
something very very bad:

Disk ad4 has a good temperature, and no sign of bad blocks/sectors.  The
disk had been powered on for a total of 7799 hours.

There was a CRC error detected when attempting to set specific
capabilities on the device.  The error occurred at LBA 0 on the disk,
which is completely bizarre, but the SMART error log might just say LBA
0 to indicate "no LBA was being accessed" (e.g. the error was purely
during the mode setting attempts).  However, the SMART error "wraps" its
timestamps at 49.710 days (every 1149.840 hours), so it's going to be
difficult to determine if the below SMART error log entry was from long
ago, or was fairly recent.  Looking at other disks might help, so let's
continue.

Disk ad5 has an excellent temperature, and no sign of bad blocks/sectors
either.  The disk has been powered on for a total of 11956 hours.  No
errors were found in the SMART log.

Disk ad6 has a good temperature, and no sign of bad blocks/sectors.  No
errors were found in the SMART log.

Disk ad7 has an excellent temperature, and no sign of bad blocks/sectors
either.  The disk had been powered on for a total of 12512 hours.

However, much like disk ad4, this disk also witnessed a CRC error when
attempting to either do a DMA read operation or when setting
capabilities on the device.  I'm prone to believe it's when setting
capabilities, because LBA 0 is also seen here, which isn't a likely LBA.
This error happened at the 6310 hour mark, which was about half of its
lifetime ago.

All of this is somewhat of a mystery.  Disk ad4 is on a completely
different physical cable than disk ad7, so that *could* rule out cabling
problems.  The errors seen are only when setting device capabilities
(making an educated guess, but I'm not 100% positive), not when actually
accessing data on the disks.  Heck, I'm not even sure the errors in the
SMART log are accurate, as the disks have been powered on for quite some
time after the supposed errors occurred.

Power draw could also explain this, ditto with the voltage possibility.

I would start by doing 3 easy things:

1) Re-enable DMA mode; it's obviously not the cause of your problems
since PIO mode shows the same problem for you,

2) Replacing both sets of PATA cables with brand new ones.  There's no
evidence this is the problem, but changing these is easy and cheap.  If
it doesn't solve the problem, then you're one step closer to tracking it
down,

3) Getting voltages from the BIOS and providing them here.  Again, this
won't be an accurate representation of the system under load, but it's
the best we've got right now.

Assuming the problem continues after #2, and the voltages shown in #3
look good, this is what I'd do for the next step:

Buy a PCI, PCI-X (if this make sure it's backwards-compatible with
32-bit 33MHz PCI slots, unless you actually have a PCI-X slot!) or PCI
Express PATA controller -- specifically, one that does not use a Silicon
Image chip.  This may be hard to accomplish since PATA is a dying
interface (and good riddance!).

I will also stress this in capitals, just to make it clear: DO NOT BUY A
SATA CONTROLLER THEN USE PATA-TO-SATA ADAPTERS.  Those adapters will
cause you even more problems.  If you go the SATA route, buy actual SATA
disks and recycle or sell your old PATA ones.

That said, Highpoint and Promise both make PATA controllers -- not to
mention, I even see that you've tried to load the hptrr(4) driver on
that system!  :-) Additionally, DO NOT use the "RAID" features of these
cards (if you end up buying one that has such); just plug the disks in
and use them in a JBOD fashion.

You might find that the disk numbers (e.g. ad4) change on you when
doing this; that's to be expected.

Others might recommend that you should try replacing the PSU before
buying a new PATA controller, but I have doubts the problem is with the
PSU; I would expect more odd/awkward problems if the PSU was to blame.
If you do try a different PSU, go with one that does 450W or more.  You
DO NOT need a l33t-g4m3-d00dz-omgwtfbbq!! 850-1000W PSU; most of the
power draw for hard disks happens during power-on, when the disks have
to spin up, not once they're already spinning.

Hope this helps, and good luck!

-- 
| Jeremy Chadwick                                jdc at parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.              PGP: 4BD6C0CB |