Showstopper ATA bug in 6.1-PRE? (Still happening)

Paolo Maero paolo at euresis.it
Wed Feb 15 15:41:59 PST 2006


On Feb 10, 2006, at 9:03 AM, Wilko Bulte wrote:
> On Thu, Feb 09, 2006 at 11:08:24PM +0100, Wilko Bulte wrote..
>> On Thu, Feb 09, 2006 at 09:24:23PM +0100, Sren Schmidt wrote..
>>> Wilko Bulte wrote:
>>>> On Thu, Feb 09, 2006 at 03:45:53PM +0100, Sren Schmidt wrote..
>>>>> Wilko Bulte wrote:
>>>>>> On Thu, Feb 09, 2006 at 03:37:07PM +0100, Sren Schmidt wrote..
>>>>>>> Wilko Bulte wrote:
>>>>>>>> On Wed, Feb 08, 2006 at 10:44:05PM +0100, Sren Schmidt wrote..
>>>>>>>>> Wilko Bulte wrote:
>>>>>>>>>> On Wed, Feb 08, 2006 at 10:02:08PM +0100, Sren Schmidt  
>>>>>>>>>> wrote..
>>>>>>>>>>> Wilko Bulte wrote:
>>>>>>>>>>>> Hi Soren,
>>>>>>>>>>>>
>>>>>>>>>>>> I just went to 6.1-PRE on my main machine, coming from  
>>>>>>>>>>>> 6.0-STABLE
>>>>>>>>>>>> of roughly end of december.
>>>>>>>>>>>>
>>>>>>>>>>>> And I hit some stuff that really worries me:
>>>>>>>>>>>>
>>>>>>>>>>>> - the freshly built kernel keels over with (hand  
>>>>>>>>>>>> transcribed):
>>>>>>>>>>>>
>>>>>>>>>>>> ata3: reiniting channel SATA connect ...
>>>>>>>>>>>> SATA connected
>>>>>>>>>>>> sata_connect_devices 0x1 <ATA_MASTER>
>>>>>>>>>>>>
>>>>>>>>>>>> ad6: req=0xC35ba0c8 SETFEATURES SETTRANSFERMODE  
>>>>>>>>>>>> semaphore timeout
>>>>>>>>>>>> !! DANGER Will RObinson !!
>>>>>>>>>>>>
>>>>>>>>>>>> (... is where I cannot read my own handwriting, it  
>>>>>>>>>>>> scrolled quite
>>>>>>>>>>>> fast on
>>>>>>>>>>>> the screen..)
>>>>>>>>>>>>
>>>>>>>>>>>> Boot device is a SATA RAID1 on a Promise 2300.
>>>>>>>>>>> Hmm, that should not happen. Could you try to backstep  
>>>>>>>>>>> just ATA to
>>>>>>>>>>> before the MFC, that is 24/1/06 and let me know if that  
>>>>>>>>>>> helps
>>>>>>>>>>> please ?
>>>>>>>>>> First impression is that the problem is gone.  None of the
>>>>>>>>>> previously reported errors are seen.  I am running a level  
>>>>>>>>>> 0 dump
>>>>>>>>>> from disk to disk
>>>>>>>>>> to see if the box remains stable.  Given that this is my  
>>>>>>>>>> primary
>>>>>>>>>> machine
>>>>>>>>>> I sure hope it will be :-)
>>>>>>>>>>
>>>>>>>>>>>> Another snag is that my ad10 disk on 6.0-STABLE suddenly  
>>>>>>>>>>>> became
>>>>>>>>>>>> ad12 on
>>>>>>>>>>>> 6.1-PRE
>>>>>>>>>>> Hmm that is because there is only 2 ports on your promise  
>>>>>>>>>>> which is
>>>>>>>>>>> now correctly identified, before it was errounsly found  
>>>>>>>>>>> as 3 ports.
>>>>>>>>>> Ah, OK.  I would suggest a note to the Release Note  
>>>>>>>>>> writers would be
>>>>>>>>>> a good
>>>>>>>>>> thing, devices changing location after an upgrade in the - 
>>>>>>>>>> stable
>>>>>>>>>> branch
>>>>>>>>>> is unnerving ;-)
>>>>>>>>> Well, the good thing is that I can reproduce the error  
>>>>>>>>> here, the bad
>>>>>>>>> thing is that it slipped through testing on -current...
>>>>>>>>> Oh, well, I'll look into it ASAP...
>>>>>>>> Thank you Soren!
>>>>>>> OK, had a few this afternoon, could you try this patch and  
>>>>>>> let me know
>>>>>>> if it helps, at least it makes the problem go away on my  
>>>>>>> testbed..
>>>>>> Is this relative to HEAD or RELENG_6?  I cannot / will not go  
>>>>>> to HEAD
>>>>>> with this machine (my main production box.. :-)
>>>>> Doesn't matter, ATA is the same on both...
>>>>
>>>> OK, I was not sure if they were 100% identical.
>>>>
>>>> The patch at first impression seems to have eliminated the problem.
>>>
>>> Good seems I'm on the right track at least.
>>>
>>>> Interestingly enough ad10 remained ad10 with the patch applied?
>>>
>>> Yeah, thats intentional, I though we better not break POLA here..
>>
>> I agree :-)
>>
>>>> I'll put some load on to see what happens.
>>>
>>> Let me know how that turns out, I'll clean things up a bit and  
>>> get it
>>> committed to -current, then get permission to MFC when we are  
>>> sure it
>>> fixes the problem...
>>
>> I ran a 44GB disk-to-disk dump without incidents (source on the  
>> RAID1,
>> target on the JBOD).  No problems whatsoever.
>>
>> Looks like things behave much better now.  Tonight the machine will
>> run a daily full dump to DLT tape, I'll know how that turns out  
>> tomorrow.
>
> Backup ran without problems.
>
> -- 
> Wilko Bulte				wilko at FreeBSD.org
> _______________________________________________
> freebsd-stable at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to "freebsd-stable- 
> unsubscribe at freebsd.org"
>

Soren,

I am still getting this error. I cvsup'd to RELENG_6 today (Feb. 15,  
2006 at 6:00 PM CET) and I got the following files from HEAD:
ata-all.c (1.264)
ata-all.h (1.113)
ata-chipset.c (1.158)
ata-pci.c (1.115)

I have then rebuilt the kernel and world and I tested the system.

I have a Promise TX2300 with two 250 GB Maxtor in RAID1 with gmirror,  
plus two 9 GB SCSI drives in RAID1 with gmirror where I put the OS  
and applications and a 160 GB UATA Seagate (attached to the  
motherboard UATA) where I backup sensitive data.

The system was able to rebuild the mirror (I always got the error  
with RELENG_6_0), but then I made a backup tar from the two Maxtor's  
mirror to the Seagate and I got the following error after having  
copied approx 16 GB:

Feb 15 20:56:31 zoe kernel: GEOM_MIRROR: Device data1: provider ad4  
disconnected.
Feb 15 20:56:31 zoe kernel: GEOM_MIRROR: Device data1: provider ad4  
detected.
Feb 15 20:56:31 zoe kernel: GEOM_MIRROR: Device data1: rebuilding  
provider ad4.
...
Feb 15 22:33:15 zoe kernel: GEOM_MIRROR: Device data1: rebuilding  
provider ad4 finished.
Feb 15 22:33:15 zoe kernel: GEOM_MIRROR: Device data1: provider ad4  
activated.
...
Feb 15 23:04:12 zoe kernel: ad4: req=0xc2422578 SETFEATURES SET  
TRANSFER MODE semaphore timeout !! DANGER Will Robinson !!
Feb 15 23:04:28 zoe last message repeated 4 times
Feb 15 23:04:32 zoe kernel: ad4: req=0xc2422578 SETFEATURES SET  
TRANSFER MODE semaphore timeout !! DANGER Will Robinson !!
Feb 15 23:05:04 zoe last message repeated 8 times
Feb 15 23:07:08 zoe last message repeated 31 times
Feb 15 23:08:56 zoe last message repeated 27 times
Feb 15 23:09:00 zoe kernel: ad4: req=0xc2422578 SETFEATURES SET  
TRANSFER MODE semaphore timeout !! DANGER Will Robinson !!
Feb 15 23:09:32 zoe last message repeated 8 times
Feb 15 23:11:36 zoe last message repeated 31 times
Feb 15 23:21:40 zoe last message repeated 151 times
Feb 15 23:31:40 zoe last message repeated 150 times
Feb 15 23:34:28 zoe last message repeated 42 times
Feb 15 23:34:32 zoe kernel: ad4: req=0xc2422578 SETFEATURES SET  
TRANSFER MODE semaphore timeout !! DANGER Will Robinson !!
Feb 15 23:35:04 zoe last message repeated 8 times
Feb 15 23:37:08 zoe last message repeated 31 times
Feb 15 23:40:52 zoe last message repeated 56 times
Feb 15 23:40:53 zoe shutdown: power-down by paolo:
Feb 15 23:40:56 zoe syslogd: exiting on signal 15

I was not able to shutdown the system cleanly.

I also noticed a small problem in the code (probably not related to  
this error):

At line 1011 of src/sys/dev/ata/ata-all.c a free() is used but  
ata_delayed_attach is not reset to NULL

Actual (1.264):
             if (config_intrhook_establish(ata_delayed_attach) != 0) {
                 printf("ata: config_intrhook_establish failed\n");
                 free(ata_delayed_attach, M_TEMP);
             }

Fix:
             if (config_intrhook_establish(ata_delayed_attach) != 0) {
                 printf("ata: config_intrhook_establish failed\n");
                 free(ata_delayed_attach, M_TEMP);
+               ata_delayed_attach = NULL;
             }

This is my dmesg:

Copyright (c) 1992-2006 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
The Regents of the University of California. All rights reserved.
FreeBSD 6.1-PRERELEASE #0: Wed Feb 15 20:27:17 CET 2006
root at zoe.paolo.maero.net:/usr/obj/usr/src/sys/ZOE_022
ACPI APIC Table: <Intel  N440BX  >
Timecounter "i8254" frequency 1193182 Hz quality 0
CPU: Intel Pentium III (696.41-MHz 686-class CPU)
Origin = "GenuineIntel"  Id = 0x683  Stepping = 3
Features=0x387fbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE, 
MCA,CMOV,PAT,PSE36,PN,MMX,FXSR,SSE>
real memory  = 536805376 (511 MB)
avail memory = 516100096 (492 MB)
FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs
cpu0 (BSP): APIC ID:  1
cpu1 (AP): APIC ID:  0
ioapic0 <Version 1.1> irqs 0-23 on motherboard
npx0: [FAST]
npx0: <math processor> on motherboard
npx0: INT 16 interface
acpi0: <PTLTD   RSDT> on motherboard
acpi0: Power Button (fixed)
Timecounter "ACPI-safe" frequency 3579545 Hz quality 1000
acpi_timer0: <24-bit timer at 3.579545MHz> port 0xc08-0xc0b on acpi0
cpu0: <ACPI CPU> on acpi0
cpu1: <ACPI CPU> on acpi0
pcib0: <ACPI Host-PCI bridge> port 0xcf8-0xcff on acpi0
pci0: <ACPI PCI bus> on pcib0
agp0: <Intel 82443GX host to PCI bridge> mem 0xf8000000-0xfbffffff at  
device 0.0 on pci0
pcib1: <PCI-PCI bridge> at device 1.0 on pci0
pci1: <PCI bus> on pcib1
pcib2: <PCI-PCI bridge> at device 15.0 on pci1
pci2: <PCI bus> on pcib2
atapci0: <Promise PDC20771 SATA300 controller> port 0x3400-0x347f, 
0x3000-0x30ff mem 0xf4220000-0xf4220fff,0xf4200000-0xf421ffff irq 23  
at device 7.0 on pci2
ata2: <ATA channel 0> on atapci0
ata3: <ATA channel 1> on atapci0
ata4: <ATA channel 2> on atapci0
ahc0: <Adaptec aic7896/97 Ultra2 SCSI adapter> port 0x2400-0x24ff mem  
0xf4101000-0xf4101fff irq 19 at device 12.0 on pci0
ahc0: [GIANT-LOCKED]
aic7896/97: Ultra2 Wide Channel A, SCSI Id=7, 32/253 SCBs
ahc1: <Adaptec aic7896/97 Ultra2 SCSI adapter> port 0x2800-0x28ff mem  
0xf4102000-0xf4102fff irq 19 at device 12.1 on pci0
ahc1: [GIANT-LOCKED]
aic7896/97: Ultra2 Wide Channel B, SCSI Id=7, 32/253 SCBs
fxp0: <Intel 82559 Pro/100 Ethernet> port 0x2000-0x203f mem  
0xf4100000-0xf4100fff,0xf4000000-0xf40fffff irq 21 at device 14.0 on  
pci0
miibus0: <MII bus> on fxp0
inphy0: <i82555 10/100 media interface> on miibus0
inphy0:  10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto
fxp0: Ethernet address: 00:d0:b7:88:6c:c5
isab0: <PCI-ISA bridge> at device 18.0 on pci0
isa0: <ISA bus> on isab0
atapci1: <Intel PIIX4 UDMA33 controller> port  
0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0x2040-0x204f at device 18.1 on pci0
ata0: <ATA channel 0> on atapci1
ata1: <ATA channel 1> on atapci1
uhci0: <Intel 82371AB/EB (PIIX4) USB controller> port 0x2060-0x207f  
irq 21 at device 18.2 on pci0
uhci0: [GIANT-LOCKED]
usb0: <Intel 82371AB/EB (PIIX4) USB controller> on uhci0
usb0: USB revision 1.0
uhub0: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1
uhub0: 2 ports with 2 removable, self powered
intpm0: <Intel 82371AB Power management controller> port  
0x1040-0x104f irq 9 at device 18.3 on pci0
intpm0: I/O mapped 1040
intpm0: intr IRQ 9 enabled revision 0
intpm0: [GIANT-LOCKED]
intsmb0: <Intel PIIX4 SMBUS Interface> on intpm0
smbus1: <System Management Bus> on intsmb0
smb0: <SMBus generic I/O> on smbus1
intpm0: PM I/O mapped c00
pci0: <display, VGA> at device 20.0 (no driver attached)
atkbdc0: <Keyboard controller (i8042)> port 0x60,0x64 irq 1 on acpi0
atkbd0: <AT Keyboard> irq 1 on atkbdc0
kbd0 at atkbd0
atkbd0: [GIANT-LOCKED]
fdc0: <floppy drive controller> port 0x3f4-0x3f7 irq 6 drq 2 on acpi0
fdc0: No FDOUT register!
device_attach: fdc0 attach returned 6
ppc0: <ECP parallel printer port> port 0x378-0x37f,0x778-0x77f irq 7  
drq 3 on acpi0
ppc0: Generic chipset (ECP/PS2/NIBBLE) in COMPATIBLE mode
ppc0: FIFO with 16/16/8 bytes threshold
ppbus0: <Parallel port bus> on ppc0
plip0: <PLIP network interface> on ppbus0
lpt0: <Printer> on ppbus0
lpt0: Interrupt-driven port
ppi0: <Parallel I/O> on ppbus0
sio0: <16550A-compatible COM port> port 0x3f8-0x3ff irq 4 flags 0x10  
on acpi0
sio0: type 16550A
sio1: <16550A-compatible COM port> port 0x2f8-0x2ff irq 3 on acpi0
sio1: type 16550A
fdc0: <floppy drive controller> port 0x3f4-0x3f7 irq 6 drq 2 on acpi0
fdc0: No FDOUT register!
device_attach: fdc0 attach returned 6
pmtimer0 on isa0
orm0: <ISA Option ROMs> at iomem 0xc0000-0xc7fff,0xc8000-0xcdfff, 
0xce000-0xcf7ff on isa0
sc0: <System console> at flags 0x100 on isa0
sc0: VGA <16 virtual consoles, flags=0x300>
vga0: <Generic ISA VGA> at port 0x3c0-0x3df iomem 0xa0000-0xbffff on  
isa0
ulpt0: EPSON USB Printer, rev 1.10/1.00, addr 2, iclass 7/1
ulpt0: using bi-directional mode
Timecounters tick every 1.000 msec
ad0: 152627MB <WDC WD1600BB-00HTA0 15.05R15> at ata0-master UDMA33
acd0: CDRW <HL-DT-ST GCE-8525B/1.02> at ata1-master UDMA33
ad4: 239372MB <Maxtor 7L250S0 BANC1E00> at ata2-master SATA150
GEOM_MIRROR: Device data1 created (id=618669616).
GEOM_MIRROR: Device data1: provider ad4 detected.
ad6: 239372MB <Maxtor 7L250S0 BANC1E00> at ata3-master SATA150
Waiting 5 seconds for SCSI devices to settle
GEOM_MIRROR: Device data1: provider ad6 detected.
GEOM_MIRROR: Device data1: provider ad6 activated.
GEOM_MIRROR: Device data1: provider ad4 is stale.
GEOM_MIRROR: Device data1: provider mirror/data1 launched.
SMP: AP CPU #1 Launched!
da0 at ahc0 bus 0 target 0 lun 0
da0: <IBM DNES-309170W SAH0> Fixed Direct Access SCSI-3 device
da0: 80.000MB/s transfers (40.000MHz, offset 30, 16bit), Tagged  
Queueing Enabled
da0: 8748MB (17916240 512 byte sectors: 255H 63S/T 1115C)
da1 at ahc0 bus 0 target 1 lun 0
da1: <IBM DNES-309170W SAH0> Fixed Direct Access SCSI-3 device
da1: 80.000MB/s transfers (40.000MHz, offset 30, 16bit), Tagged  
Queueing Enabled
da1: 8748MB (17916240 512 byte sectors: 255H 63S/T 1115C)
GEOM_MIRROR: Device sgm0 created (id=1390498924).
GEOM_MIRROR: Device sgm0: provider da0 detected.
GEOM_MIRROR: Device sgm0: provider da1 detected.
GEOM_MIRROR: Device sgm0: provider da1 activated.
GEOM_MIRROR: Device sgm0: provider da0 activated.
GEOM_MIRROR: Device sgm0: provider mirror/sgm0 launched.
Trying to mount root from ufs:/dev/mirror/sgm0s1a

Regards

Paolo









More information about the freebsd-stable mailing list