ata timeouts under load

Wed Sep 16 04:31:35 UTC 2009

On Sun, 13 Sep 2009, Kris Kennaway wrote:

> Alexander Motin wrote:
>> Kris Kennaway wrote:
>>> I am getting timeouts on 8.0b4/HEAD when I do a lot of ZFS I/O to a pool
>>> on ad4:
>>> 
>>> atapci0: <VIA 6420 SATA150 controller> port
>>> 0xc800-0xc807,0xc400-0xc403,0xc000-0xc007,0xb800-0xb803,0xb400-0xb40f,0xb000-0xb0ff
>>> irq 20 at device 15.0 on pci0
>>> ata2: <ATA channel 0> on atapci0
>>> ata3: <ATA channel 1> on atapci0
>>> ata0: <ATA channel 0> on atapci1
>>> ata1: <ATA channel 1> on atapci1
>>> 
>>> ad4: 476940MB <WDC WD5000AAKS-00TMA0 12.01C01> at ata2-master SATA150
>>> ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout -
>>> completing request directly
>>> ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout -
>>> completing request directly
>>> ad4: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout - completing
>>> request directly
>>> ad4: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing
>>> request directly
>>> ad4: WARNING - SET_MULTI taskqueue timeout - completing request directly
>>> ad4: TIMEOUT - WRITE_DMA48 retrying (1 retry left) LBA=344052040
>>> ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout -
>>> completing request directly
>>> ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout -
>>> completing request directly
>>> 
>>> It becomes stuck in a loop displaying the above and is unable to
>>> complete further I/O operations.  I wonder if it is just batching up a
>>> lot of I/O and then timing out because it is busy, and then not
>>> recovering from this state?
>>> 
>>> Any ideas what could be wrong?
>> 
>> There are two different kinds of timeouts we can see:
>>  - first one, "ad4: WARNING - ..." is just a queue waiting timeout. It
>> is not the reason, but consequence of the problem. And I have doubts
>> that it is reasonable to do it.
>>  - second one, "TIMEOUT - WRITE_DMA48 ..." is a real command execution
>> timeout. I don't know whether this is result of some improper error
>> recovery, or you drive indeed lost required servo information near
>> LBA=344052040 and tries to find it too long. You can try to read that
>> sector and nearby ones with dd.
>> 
>
> It's always that sequence (with setfeatures timing out first, then the dma 
> later)...and the block number varies widely, also whether it's read/write. 
> The disk itself & the data it contains appears to be OK as far as I have been 
> able to determine so far.

This may not be meaningful, but I used to have a lot of very similar (the 
messages, loop, etc is exactly the same) problems with VIA chipsets and an 
AMD cpu. Seemed to be triggered by a certain drive, but I never could 
figure it out totally. Moved to an Intel board/cpu and I've never seen it 
since. Looks like an older SATA1 chipset, so perhaps it could be the same 
problem. Problem was not related to zfs.