[ATA] and re(4) stability issues

Arnaud Houdelette arnaud.houdelette at tzim.net
Wed Dec 10 04:18:06 PST 2008

Victor Balada Diaz a écrit :
> Hello,
> I got various machines[1] at hetzner.de and I've been having problems
> with interrupts on FreeBSD 7.0 and now FreeBSD 7.1 -BETA2 in amd64. I've
> been trying to narrow the problem so someone more knowledgeable than me
> is able to fix it. This mail is an other attempt to ask a question
> with regards ATA code to see if this time i got something.
> For the ones that don't actually know what happened:
> With FreeBSD 7.0 -RELEASE for amd64 and default kernel
> the system shared re0 interrupt with OHCI and this caused
> re(4) to corrupt packets and create interrupt storms. Tried
> updating to 7.1 -BETA2 and still had some problems with it.
> I've opened the PR kern/128287[2] and Remko quickly answered
> with a workaround: that workaround was removing USB support from
> my kernel. I did it and re(4) wasn't sharing interrupts anylonger,
> and the interrupt storms were gone. Now sometime later the interface
> goes up and down from time to time, but less often. Also sometimes
> the machine losts the network interface but continues to work.
> I know it continues to work because some days later i can see that
> it tried to deliver the status reports but was unable to resolve the
> aliases hostnames. I can't ping the machine and i know the network
> is OK. If i reboot the machine everything is working again.
> When switched from 7.0 to 7.1 BETA2 i also found that under load
> after some hours the machine created interrupt storms on ATA disks.
> Digging at linux source code i've found that they do some special things
> for this chipset that i've been unable to find on our code. This is
> linux code for my chipset:
> 371                 AHCI_HFLAGS     (AHCI_HFLAG_IGN_SERR_INTERNAL |
> 372                                  AHCI_HFLAG_32BIT_ONLY | AHCI_HFLAG_NO_MSI |
> 373                                  AHCI_HFLAG_SECT255),
> File and the rest of the code in here[3].
> As i saw AHCI_HFLAG_NO_MSI i tried doing the easiest thing i could
> think of, switching MSI and MSI-x off for the whole system, so
> i added to /boot/loader.conf this tunables:
> hw.pci.enable_msix="0"
> hw.pci.enable_msi="0"
> And then rebooted the machine. After various hours of doing almost nothing
> i've found that the machine answered ping but was unable to answer any
> request (eg, ssh, nagios nrpe, etc). The machine recovered itself after
> some minutes and when i was able to ssh into i saw the following in dmesg:
> ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly
> ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly
> ad4: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout - completing request directly
> ad4: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request directly
> ad4: WARNING - SET_MULTI taskqueue timeout - completing request directly
> ad4: TIMEOUT - WRITE_DMA48 retrying (1 retry left) LBA=1463123158
> and a lot more errors like that. I didn't get this errors with MSI enabled.
> I see WRITE_DMA48 and in linux code i saw AHCI_HFLAG_32BIT_ONLY which is later
> used for DMA related things. Could someone who is more knowledgeable check
> if we're doing the right thing?
> I've attached verbose dmesg of a machine that's like this one with
> 7.1 -BETA2, MSI enabled and GENERIC kernel minus USB and firewrire.
> Also, please, could someone give me a hand on how could i continue debugging
> this interrupt issues? I'm a bit lost and digging code and posting each
> time i think i've found something is not going to go anywhere.
> I would also like to say that i've seen reports of this kind of problems
> on amd64 machines in the lists since various years ago, so i don't think
> this is just a problem with this BIOS/motherboard (MSI K9AG Neo2 Digital)
> on the lists
> Thanks in advance for any help.
> Regards.
> [1]: http://www.hetzner.de/hosting/produkte_rootserver/ds7000/
> [2]: http://www.freebsd.org/cgi/query-pr.cgi?pr=kern/128287
> [3]: http://fxr.watson.org/fxr/source/drivers/ata/ahci.c?v=linux-2.6#L369

Sorry I didn't take the time to read all the thread, but I got similar 
problem with the same IXP600 chipset.
Only it was'nt with a Realtek NIC (re) but with a Ralink wireless one. 
The simptoms where similar : interrupt 22 was shared between the sata 
controler and the wireless card. And I got Interrupt Storms at random 
times when using the wireless network.

No problem since I removed the ral(4) NIC (got a real access point now).
You might not want to point the finger at the re(4) driver too fast.

Arnaud Houdelette

More information about the freebsd-amd64 mailing list