"ad0: TIMEOUT - WRITE_DMA" type errors with 7.0-RC1

Joe Peterson joe at skyrush.com
Fri Jan 25 17:03:17 PST 2008


Jeremy Chadwick wrote:
> Joe, I wanted to send you a note about something that I'm still in the
> process of dealing with.  The timing couldn't be more ironic.
> 
> I decided it would be worthwhile to migrate from my two-disk ZFS stripe
> with a non-ZFS disk for nightly backups, to to a RAIDZ pool of all 3
> disks combined (since they're all the same size).  I had another
> terminal with gstat -I500ms running in it, so I could see overall I/O.
> 
> All was going well until about the 81GB mark of the copy.  gstat started
> showing 0KB in/out on all the drives, and the rsync was stalled.  ^Z did
> nothing, which is usually a bad sign.  :-)  I ssh'd in and did a dmesg
> (summarised):
> 
> ad6: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly
> ad6: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request directly
> ad6: WARNING - SET_MULTI taskqueue timeout - completing request directly
> ad6: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=13951071
> ad6: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=13951327
> ad6: FAILURE - WRITE_DMA timed out LBA=13951071
> ad6: FAILURE - WRITE_DMA timed out LBA=13951327
> ad6: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=13951583
> ad6: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=13951839
> ad6: FAILURE - WRITE_DMA timed out LBA=13951583
> ad6: FAILURE - WRITE_DMA timed out LBA=13951839
> ad6: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=13952095
> ad6: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=13952351
> g_vfs_done():ad6s1d[WRITE(offset=7142916096, length=131072)]error = 5
> g_vfs_done():ad6s1d[WRITE(offset=7143047168, length=131072)]error = 5
> g_vfs_done():ad6s1d[WRITE(offset=7143178240, length=131072)]error = 5
> g_vfs_done():ad6s1d[WRITE(offset=7143309312, length=131072)]error = 5
> g_vfs_done():ad6s1d[WRITE(offset=7143440384, length=131072)]error = 5
> 
> It appears my /dev/ad6 (a Seagate -- more irony) must have some bad
> blocks.  Actually, after letting things go for a while, I realised the
> box just locked up.  Probably kernel panic'd due to the I/O problem.
> I'll have to poke at SMART stats later to see what showed up.

Wow, pretty crazy!  Hmm, and yes, those LBAs do look close together.
Well, let me know how the smartctl output looks.  I'd be curious if your
bad sector count rises.  I had noticed that 1

BTW, I tried:

crater# dd if=/dev/ad1s4 of=/dev/null bs=64k
^C1408596+0 records in
1408596+0 records out
92313747456 bytes transferred in 1415.324362 secs (65224446 bytes/sec)

(I let it go for 92GB or so) - no messages about ad1.  So I wonder if
this points at either the cable connector on ad0 or the drive itself.  I
guess I'd rather have a failing drive than motherboard...

I originally was wondering if somehow something peculiar about ZFS's
disk access pattern was making it happen...

THanks for the recomendations.  I'll keep an eye on it, and I'll let you
know what a cable change does for me.  Still, I have not had any ad0
messages since this morning (I haven't been using the system today much,
but maybe the cron processes are more likely to trigger it...

					-Joe


More information about the freebsd-stable mailing list