ATA driver/gmirror problems, multiple boxes...

Johan Ström johan at stromnet.se
Wed Apr 25 07:42:43 UTC 2007


Hello

I got a few boxes, elfi crus and gw-1, running gmirror. These are  
three completely different boxes, but all are running 6.1. They all  
have multiple disks which are gmirrored, two of them SATA-only and  
one has a mirror between one SATA and one ATA.
Some times now and then they all have different problems with the  
mirrors.. All three in different ways.. although elfi being the one  
crashing most, its also the one with most disk IO so that might be  
"expected" (not that it crashes but that its the one crashing most  
often)..
First, some HW spec:

elfi:
FreeBSD elfi.stromnet.se 6.2-RELEASE FreeBSD 6.2-RELEASE #9: Thu Jan  
18 16:53:20 CET 2007     root@:/usr/obj/usr/src/sys/ELFI  i386
atapci1: <nVidia nForce3 Pro SATA150 controller> port  
0x9f0-0x9f7,0xbf0-0xbf3,0x970-0x977,0xb70-0xb73,0xdc00-0xdc0f, 
0xe000-0xe07f irq 21 at device 10.0 on pci0
ad4: 286187MB <Maxtor 7L300S0 BANC1G10> at ata2-master SATA150
ad6: 286187MB <Maxtor 7L300S0 BANC1G10> at ata3-master SATA150
Mirror gm0s1 consist of ad4+ad6

crus:
FreeBSD crus.stromnet.org 6.1-RELEASE FreeBSD 6.1-RELEASE #3: Tue  
May  9 20:40:23 CEST 2006     johan at elfi.stromnet.org:/usr/obj/usr/ 
src/sys/GENERIC  i386
atapci1: <Promise PDC40518 SATA150 controller> port 0x7480-0x74ff, 
0x7800-0x78ff mem 0xfebdb000-0xfebdbfff,0xfebe0000-0xfebfffff irq 22  
at device 14.0 on pci1
ad8: 305245MB <Seagate ST3320620AS 3.AAE> at ata4-master SATA150
ad12: 305245MB <Seagate ST3320620AS 3.AAE> at ata6-master SATA150
Mirror gm1 consists of ad8+ad12

gw-1:
FreeBSD gw-1.stromnet.se 6.2-RELEASE-p1 FreeBSD 6.2-RELEASE-p1 #7:  
Tue Feb 13 18:24:34 CET 2007     johan at elfi.stromnet.se:/usr/obj/usr/ 
src/sys/ROUTER.POLLING  i386
atapci0: <nVidia nForce2 Pro UDMA133 controller> port  
0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xffa0-0xffaf at device 9.0 on pci0
atapci1: <nVidia nForce2 Pro SATA150 controller> port  
0xec00-0xec07,0xe880-0xe883,0xe800-0xe807,0xe480-0xe483,0x7f00-0x7f0f, 
0x7c00-0x7c7f irq 20 at device 11.
ad2: 38166MB <WDC WD400BB-00CAA1 17.07W17> at ata1-master UDMA100
ad6: 152627MB <SAMSUNG HD160JJ ZM100-41> at ata3-master SATA150
Mirror gm0 consists of ad6s1+ad2

A typical crash on elfi looks like this:
Apr 24 05:20:27 elfi kernel: ad6: FAILURE - device detached
Apr 24 05:20:27 elfi kernel: subdisk6: detached
Apr 24 05:20:27 elfi kernel: ad6: detached
Apr 24 05:20:27 elfi kernel: GEOM_MIRROR: Device gm0s1: provider ad6  
disconnected.
Apr 24 05:20:27 elfi kernel: g_vfs_done():mirror/gm0s1f[READ 
(offset=16972791808, length=16384)]error = 6

This can happen any time of the day, this one was from ~5 in the  
morning. To recover from this I have to reboot (soft reboot works)  
the box and then it will rebuild when booted. atacontrol cannot find  
the disk at all before rebooting. I've tried reinit and detach/attach  
but no help.

A crash on crus can look like this:
Apr 23 13:45:49 crus kernel: ad8: TIMEOUT - READ_DMA48 retrying (1  
retry left) LBA=566657039
Apr 23 13:46:14 crus kernel: ad8: WARNING - READ_DMA48 UDMA ICRC  
error (retrying request) LBA=566657039
Apr 23 13:46:14 crus kernel: ad8: WARNING - SETFEATURES SET TRANSFER  
MODE taskqueue timeout - completing request directly
Apr 23 13:46:14 crus kernel: ad8: WARNING - SETFEATURES SET TRANSFER  
MODE taskqueue timeout - completing request directly
Apr 23 13:46:14 crus kernel: ad8: WARNING - SETFEATURES ENABLE RCACHE  
taskqueue timeout - completing request directly
Apr 23 13:46:14 crus kernel: ad8: WARNING - SETFEATURES ENABLE WCACHE  
taskqueue timeout - completing request directly
Apr 23 13:46:14 crus kernel: ad8: WARNING - SET_MULTI taskqueue  
timeout - completing request directly
Apr 23 13:46:14 crus kernel: ad8: FAILURE - READ_DMA48 timed out  
LBA=566657039
Apr 23 13:46:14 crus kernel: GEOM_MIRROR: Request failed (error=5).  
ad8[READ(offset=290128403968, length=16384)]
Apr 23 13:46:14 crus kernel: GEOM_MIRROR: Device gm1: provider ad8  
disconnected.

This box can do with a gmirror forget followed by a gmirror insert  
and it will happily rebuild the array.

The worst box is gw-1:
Apr 20 03:10:59 gw-1 kernel: ad2: timeout waiting to issue command
Apr 20 03:10:59 gw-1 kernel: ad2: error issuing WRITE_DMA command
Apr 20 03:10:59 gw-1 kernel: GEOM_MIRROR: Request failed (error=5).  
ad2[WRITE(offset=37578448384, length=16384)]
Apr 20 03:10:59 gw-1 kernel: GEOM_MIRROR: Device gm0: provider ad2  
disconnected.
Apr 20 07:23:57 gw-1 syslogd: kernel boot file is /boot/kernel/kernel
Apr 20 07:23:57 gw-1 kernel: Copyright (c) 1992-2007 The FreeBSD  
Project.

Yes.. it fails and then the whole box totally HANGS... No input  
possible at all.. had to hard-reboot it with the button... Not good  
at all.. I have been running the disks that are now in elfi in this  
machine before, and at that time I had the same problem.. disk  
problems -> total hang.. That was with sata only, this appears to be  
a problem with the ATA disk too?..

I have never succeeded to force these crashes.. they appear now and  
then but I can never produce them on demand.. The crashes happens now  
and then, no regular intervals though.. For elfi:
Apr 24 05:20:27 elfi kernel: GEOM_MIRROR: Device gm0s1: provider ad6  
disconnected.
(I actually cant find any other entry in the logs, but judging from  
IRC logs: march 28, march 12, feb 13, jan 22, jan 18)

For crus:
Apr 23 13:46:14 crus kernel: GEOM_MIRROR: Device gm1: provider ad8  
disconnected.
Apr 13 09:57:49 crus kernel: GEOM_MIRROR: Device gm1: provider ad8  
disconnected.
I think it has happened once more, but thats it..

For gw-1 it's luckily only once so far.. At least with the current  
install, it has had problems when the maxtor disks was running in it  
(and i think it was 6.0 back then)

So.. Three different boxes, with three different chipsets... With  
three different crash scenarios.. But they all have problems.. So  
where is the actual problem? The HW? The chipset drivers? Gmirror  
code? I have run SMART tests on the crashing disks, no errors.. I  
have run powermax (maxtors own test program) a while back on the  
maxtor disks, no problems.. I have tried changing SATA cables on some  
of the disks, no difference..

Does anyone have any clue about what can be causing this? What is  
most likely? How do we hunt this down?

Thank you.

Johan Ström
Stromnet
johan at stromnet.se
http://www.stromnet.se/




More information about the freebsd-stable mailing list