ATA problems again ... general problem of ICH7 or ATA?

Miroslav Lachman 000.fbsd at quip.cz
Mon Aug 21 19:30:58 UTC 2006


Matt Dawson wrote:

> On Monday 21 August 2006 13:00, freebsd-stable-request at freebsd.org wrote:
> 
>>>I can confirm the same behaviour with a ULi M1689/Newcastle Athlon64
>>>based system running 6.1-RELEASE-p3 (i386). ad6 just detaches without
>>>warning and it takes a reboot to bring it back. atacontrol reinit has no
>>>effect. Tried the following to resolve the problems:
>>
>>I don't know what is supposed to be the canonical way to
>>reattach a disconnected SATA drive, but while testing our
>>new hardware and hot-pulling a drive while the system
>>was running, atacaontrol reinit didn't find the reinserted drive
>>here, either.
>>
>>atacontrol detach ata3; atacontrol attach ata3 did.
> 
> 
> Yes, that is the method for a controlled remove and reattach, a la hotplug 
> SATA. AIUI, though, if the drive goes AWOL on its own you need to reinit the 
> channel before issuing an atacontrol attach foo. In theory... (man 8 
> atacontrol) In practice, the drive disappears, never to be probed again. A 
> warm reboot without power down makes it appear again, so the drive itself 
> isn't confused.

This is same in my case.

> FWIW, the problem takes *far* longer to rear its head when the SATA controller 
> has a PCI INT and IRQ to itself. Put a NIC onto a shared slot (a very Bad 
> Thing [TM] as the BIOS simply maps the INT to a single IRQ and both devices 
> end up sharing it. Now tranfer a large file over the network and watch the 
> ensuing hilarity) and it happens at least every couple of days. Now, with the 
> slot shared with the SATA controller empty, I have six days uptime since the 
> last event, which means I'm probably due one any time now. 

I thought so, but it did not solve my problem. I had UHCI sharing same 
IRQ with SATA (both on irq 19). Instead of playing with device.hints(5), 
I disabled all unused peripheries in BIOS (USB ports, LPT port, FDD...) 
After few days, system reports next disk lose.

> At least gmirror rebuilds the array after a simple reboot, but I would expect 
> the dd operation to throw a wobbly if it's a timing issue/fight for interrupt 
> between the two drives/channels. It doesn't, which makes me wonder if I'm 
> barking up the wrong tree, but I can't help noticing that SATA channels have 
> one interrupt between them whereas PATA channels have one each and all of 
> these reports are from SATA users...

Maybe you are right, I don't saw any report with one disk machine. All 
problems comes from machines with 2 or more SATA disks.

> I wonder what pciconf -lv shows on Miroslav's system? Is the SATA controller 
> sharing an INT/IRQ with something else? Does moving that device to another 
> slot alleviate the problem at all?

SATA is no longer sharing IRQs, but problem persists.

system dmesg after verbose boot
http://www.quip.cz/1/freebsd/asus_rs120-e3/track_dmesg_verbose_2006-08-21.txt
pciconf -lv
http://www.quip.cz/1/freebsd/asus_rs120-e3/track_pciconf_2006-08-21.txt

Mentioned problem appeared only on heavy disk load (e.g. ports tree 
copy). I have 3rd system with minimal disk load running for 10 days 
without problem (FreeBSD 6.0, now in production for mentioned 10 days - 
machine is "quick replacement" of failed server, system mirrored from 
old disks to new by dump & restore)

> Please not that Miroslav and I are using totally different drives, chipsets 
> and processors. He's using, IIRC, an Intel chip with an ICH7 southbridge and 
> Samsung drives. I'm using an AMD Athlon 64 Newcastle (running the i386 port) 
> on a ULi M1689 chipset with WD RE2 drives so, although I'd be more than happy 
> to be the numpty that is wrong and to have ata(4) vindicated by someone else, 
> I suspect it is ata(4) that is the problem. However, finger pointing isn't 
> productive and is certainly not fair given that ata(4) has been progressing 
> so well. Anything else I can try to nail this irksome beast? Any suggestions 
> for where I've been an idiot (easy, tiger!) and missed something obvious?
> 
> BTW, this is a production server (DLT backed up nightly, so the data is safe) 
> so I can't just pull it to bits. I do have an identical (CPU/mobo) box in the 
> workshop as a workstation, however, which I could buy/borrow another drive 
> for and set up gmirror to try things out.


More information about the freebsd-stable mailing list