problems with sata disks (taskqueue timeout)

Tue Jan 20 01:16:44 PST 2009

Marc UBM pisze:
> Hiho! :-)
>
> Occasionally, especially when uploading a large number of files, the
> (brand-new, tested) sata disks in my fileserver spit out some of these
> errors:
>
> -----------------------
>
> Jan 19 19:51:14 hamstor kernel: ad10: WARNING - WRITE_DMA48 UDMA ICRC
> error (retrying request) LBA=882778752
>  
> Jan 19 19:51:23 hamstor kernel:
> ad10: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout -
> completing request directly
>  
> Jan 19 19:51:27 hamstor kernel: ad10:
> WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing
> request directly
>
> Jan 19 19:51:31 hamstor kernel: ad10: WARNING -
> SETFEATURES ENABLE WCACHE taskqueue timeout - completing request
> directly
>
> Jan 19 19:51:35 hamstor kernel: ad10: WARNING - SET_MULTI
> taskqueue timeout - completing request directly
>
> Jan 19 19:51:35 hamstor
> kernel: ad10: TIMEOUT - WRITE_DMA48 retrying (0 retries left)
> LBA=882778752 
>
> Jan 19 19:51:35 hamstor kernel: ad10: FAILURE -
> WRITE_DMA48
> status=ff<BUSY,READY,DMA_READY,DSC,DRQ,CORRECTABLE,INDEX,ERROR>
> error=ff<ICRC,UNCORRECTABLE,MEDIA_CHANGED,NID_NOT_FOUND,MEDIA_CHANGE_REQEST,ABORTED,NO_MEDIA,ILLEGAL_LENGTH>
> LBA=882778752
>
> Jan 19 19:51:35 hamstor root: ZFS: vdev I/O failure,
> zpool=gedaerm path=/dev/ad10 offset=451982655488 size=131072 error=5
>
> Jan 19 19:51:41 hamstor kernel: ad10: FAILURE - SET_MULTI
> status=51<READY,DSC,ERROR> error=4<ABORTED>
>
> Jan 19 19:51:41 hamstor
> kernel: ad10: TIMEOUT - WRITE_DMA48 retrying (1 retry left)
> LBA=882779008
>
> Jan 19 19:51:41 hamstor kernel: ad10: WARNING -
> WRITE_DMA48 UDMA ICRC error (retrying request) LBA=882779008 Jan 19
> 19:51:50 hamstor kernel: ad10: WARNING - SETFEATURES SET TRANSFER MODE
> taskqueue timeout - completing request directly
>
> Jan 19 19:51:54 hamstor
> kernel: ad10: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout
> - completing request directly 
>
> Jan 19 19:51:58 hamstor kernel: ad10:
> WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing
> request directly
>  
> Jan 19 19:52:02 hamstor kernel: ad10: WARNING -
> SET_MULTI taskqueue timeout - completing request directly Jan 19
> 19:52:02 hamstor kernel: ad10: FAILURE - WRITE_DMA48 timed out
> LBA=882779008
>
> Jan 19 19:52:02 hamstor root: ZFS: vdev I/O failure,
> zpool=gedaerm path=/dev/ad10 offset=451982786560 size=131072 error=5
>
> -----------------------
>
> I've fiddled with the cables, which seemed to help, but I've been
> unable to completely eliminate the errors. The disks are two Western
> Digital MyBooks Home Edition (1 TB per disk), connected to a Promise TX
> 4 SATA Controller:
>
> atapci0 at pci0:1:6:0:  class=0x018000 card=0x3d17105a chip=0x3d17105a
> rev=0x02 hdr=0x00 vendor     = 'Promise Technology Inc'
>     device     = 'PDC40718-GP SATA 300 TX4 Controller'
>     class      = mass storage
>
> They're connected via 50cm esata cables.
>
> I've googled on the net and found some vague hints about problems with
> the Promise TX4, but nothing concrete.
>
> What I've found is
>
> http://wiki.freebsd.org/JeremyChadwick/ATA_issues_and_troubleshooting
>
> basically telling me "these things happen, deal with it" :-)
>
> The problem is, I cannot produce these problems reliably, only thing I
> notice is that they *seem* to happen more often if a lot of large files
> are copied in succession.
>
> Can anybody tell me if upgrading to 7.2 oder -current will help?
>
> I'm currently running 
>
> 7.0-STABLE-200804 FreeBSD 7.0-STABLE-200804 #0: Wed Dec 10 15:29:03 CET
> 2008   ***@host:/usr/obj/usr/src/sys/GENERIC  amd64
>
> Next step I'll try is upgrading to RELENG_7 to see if that helps.
>
>
> Greetings,
> Marc
>   
Cheers Marc.

My personal experience makes me think that this issue is 
controller/driver related.
I'm using SATA 300 TX4 Controller from times of 6.1-Relaese on my 
fileserver (with 2 of 4 ports used) and I saw a lot of exactly the same 
errors in logs. Sometimes it was harmless, but sometimes as an effect of 
these one of disks magically disconnected from controller and only way 
to get it back and working was power down and up PC. That mostly 
happened while heavy I/O like while dumping filesystems.

Good thing is that starting from 7.0-release I saw such errors maybe 2-3 
times and I didn't saw them at all from at least 6 months. Probably 
because I rebuild my system about once a month to keep up with stable 
branch and something was corrected in sources through that time.

So I also advice to upgrade to RELENG_7 and you probably get rid of these.
Good luck!

-- 
Bartosz Stec