twa: Passthru request timed out! Resetting controller...

Wed Nov 15 18:07:37 PST 2006

Mark Dotson said the following on 11/14/06 1:18 PM:
> I've had continued problems with the 3ware series SATA cards and the 
> Tyan boards.  Specifically, I have a "Tyan S5360-1U" and both a 
> 9500S-4LP and a 8506 series 3ware cards.
> 
> In my case the first error is different, but the 'resetting' over and 
> over is VERY familiar.  This could be triggered by a simple file copy 
> from one part of a container to another; degrading the unit and 
> triggering the resetting crap.  Note that the drives are fine, I tested 
> that first thing.
> 
> Sep  8 11:59:23 localhost kernel: 3w-9xxx: scsi0: WARNING: 
> (0x06:0x002C): Unit #1: Command (0x2a) timed out, resetting card.
> Sep  8 11:59:41 localhost kernel: 3w-9xxx: scsi0: AEN: INFO (0x04:0x005E):
> Cache synchronized after power fail:unit=0.
> Sep  8 11:59:41 localhost kernel: 3w-9xxx: scsi0: AEN: INFO (0x04:0x005E):
> Cache synchronized after power fail:unit=1.
> 
> I also found this problem to exist across platforms, not just FreeBSD. 
> For example, the excerpt above is from a CentOS box.
> 
> All tests were done with newest firmware for both card and mobo, and 
> using the newest drivers provided by 3ware.
> 
> Once I removed the card and drives from the Tyan system and stuck them 
> in pretty much ANY other system, they worked fantastically.
> 
> I don't have an answer for the "resetting problem" as of yet... 3ware 
> and Tyan (And my system vendor "Appro") are still trying to find my 
> specific problem and solve it.  I believe they are currently doing the 
> "replace everything" method of troubleshooting.
> 
Mark, thank you.

It's good to know that the resetting problem exist on other platforms too.

We already found out that replacing the entire box with identical one 
doesn't help, so unfortunately we'll have to start replacing components 
by using different brands or models.

I wouldn't like to touch the I/O subsystem (these are already loaded 
production machines), so like you said, the safest bet would be to try 
another motherboard.

However I don't see many Dual Opteron based boards suggested by the 
3ware's compatibility list. The next one that comes in mind from that 
list is Supermicro H8DC8, but it looks more like a gamers dream 
(High-End PCI-e Graphics, SLI, etc. but no on-board VGA) than a server 
board.

I'm quite surprised that the top Opteron based motherboard manufacturer 
listed in the 3ware web site motherboard compatibility docs:
http://3ware.com/products/pdf/Motherboard_compatibility_list_9550SX_2006_06.pdf 

makes 2 out of 5 boards that are marked as compatible, but perform so 
bad with 3ware cards.

I know what happens here in this mailing list when somebody looks for 
good SATA cards (Re: 3ware, 3ware, ...), I replied myself too.

So are there any success stories with 3ware 9550SX (SATA II) and dual 
AMD Opteron server boards, or it's time to go back with Intel?

Regards,
Atanas

> Atanas wrote:
>> Has anyone experiencing this:
>>
>> twa0: ERROR: (0x05: 0x2018): Passthru request timed out!: request = 
>> 0xca839d20
>> twa0: INFO: (0x16: 0x1108): Resetting controller...:
>> twa0: INFO: (0x04: 0x005E): Cache synchronization completed: unit=0
>> ...
>> twa0: INFO: (0x04: 0x005E): Cache synchronization completed: unit=7
>> twa0: INFO: (0x04: 0x0001): Controller reset occurred: resets=1
>> twa0: INFO: (0x16: 0x1107): Controller reset done!:
>>
>> This happens on 6.2-PRERELEASE i386 (and on 6.1 since its release) on 
>> a number of machines with the following hardware configuration:
>>
>> - Tyan K8SE 2892, 2 AMD Opteron 270 CPUs, 4GB RAM
>> - 3ware 9550SX-8LP, 8 500GB Seagate ST3500641AS SATA drives
>>   (configured as 8 SINGLE DISK units, aka JBOD)
>>
>> All hardware components, including the server chassis, are listed in 
>> the 3ware hardware compatibility lists. It doesn't seem to be a 
>> cabling or power issue. The controller and hard drives are already 
>> flashed to the latest firmware revisions. I tried turning off NCQ, but 
>> it didn't make any difference. I tried also switching the kernel from 
>> PAE to non-PAE (reducing the usable memory to 3GB), but it didn't help 
>> either.
>>
>> I have another machines with similar I/O configurations (3ware), but 
>> with Intel motherboards and running FreeBSD-5.5, and these run fine 
>> for about a year already. Now I'm thinking about swapping the drives 
>> between a working Intel and AMD based box, to see where controller 
>> timeouts will follow.
>>
>> The problem happens sporadically once in a month or so and is very 
>> hard to reproduce. Sometimes it takes several weeks until the next 
>> crash happens, sometimes it crashes again in just a few hours.
>>
>> When the thing happens, the kernel sometimes panics (most likely due 
>> to the inconsistent filesystem state caused by the controller reset), 
>> sometimes just hangs. It can be interrupted (I have a serial console), 
>> but the only usable thing after that seems to be "call cpu_reset()", 
>> followed by full (and sometimes painfully long) filesystem check.
>>
>> Here are the diffs against the default GENERIC and PAE kernel 
>> configurations:
>>
>> < cpu       I486_CPU
>> < ident     GENERIC
>> < options   INET6               # IPv6 communications protocols
>> < options   SCSI_DELAY=5000     # Delay (in ms) before probing SCSI
>>
>>  > options   QUOTA
>>  > options   SMP                 # Symmetric MultiProcessor Kernel
>>  > options   BREAK_TO_DEBUGGER
>>  > options   DDB
>>  > options   KDB
>>  > options   KDB_UNATTENDED
>>
>>  > options   IPFIREWALL
>>  > options   DUMMYNET
>>
>> I'm attaching the dmesg.boot following the latest crash.
>>
>> Regards,
>> Atanas
>>