ICH7 SATA and em interrupt sharing

Pyun YongHyeon pyunyh at gmail.com
Tue Aug 22 01:51:08 UTC 2006


On Mon, Aug 21, 2006 at 09:52:02PM +0200, Patrick M. Hausen wrote:
 > And yet more testing ...
 > 
 > I rebuilt my kernel without USB devices and made sure
 > atapci1 doesn't share an interrupt with anything:
 > 
 > pcib1: 16
 > pcib2: 20
 > em0: 16
 > em1: 17
 > fxp0: 16
 > atapci1: 19
 > atkbdc0: 1
 > atkbd0: 1
 > sio0: 4
 > sio1: 3
 > ppc0: 7
 > 
 > Side note: on this particular box I had to leave the USB devices
 > enabled in the BIOS setup, otherwise em0 would end up on the same
 > interrupt as atapci1 |-)
 > 
 > Then I ran make buildworld and in parallel started to transfer a large
 > file via FTP (done by fetching a sparse file of 10 GB) maxing out
 > or 100 Mbit/s LAN.
 > 
 > *boom* - or so I thought ;-) The ssh session was stuck, the system did
 > not respond to ICMP echo. OK, wait until tomorrow morning to reset it ...
 > ... just gave it one more ping an hour later, and the machine was
 > alive again! It did not panic/reboot, the buildworld was running and
 > the file transfer was transferring a file.
 > 
 > In /var/log messages I found:
 > 
 > Aug 21 21:37:08 tomcat kernel: em0: Missing Tx completion interrupt!
 > Aug 21 21:39:55 tomcat kernel: em0: Missing Tx completion interrupt!
 > Aug 21 21:40:29 tomcat kernel: em0: Missing Tx completion interrupt!
 > 
 > Seems like for some reason the netwok card blocked for a couple
 > of minutes, then resumed.
 > 
 > This was all with debug.mpsafenet set to 1. Now I'm running the same
 > stress test with debug.mpsafenet set to 0 and I haven't seen any
 > problem/hang at all.
 > 
 > Wait a minute ... now as I'm typing this message, ssh to the
 > box hangs again. Damn.
 > 
 > I think I'll try the fxp interface for production use and disable the
 > onboard Gigabit NICs.
 > 
 > Now the ssh session is responding again while the file transfer reports
 > "Connection reset by peer".
 > 
 > Dmesg shows:
 > 
 > em0: Missing Tx completion interrupt!
 > em0: Missing Tx completion interrupt!
 > em0: Missing Tx completion interrupt!
 > em0: Missing Tx completion interrupt!
 > em0: Missing Tx completion interrupt!
 > em0: Missing Tx completion interrupt!
 > 

Thanks for the testing.
The above message means the patch really worked. Otherwise you
would have seen (false) watchdog errors on your system.
I guess the two possible cause of missing Tx completion interrupts
comes from a chipset bug or Tx interrupt moderation mechanism. If
Tx interrupt moderation mechanism is the cause of false watchdog
triggering we should have to fix all device drivers that have Tx
interrupt moderation capability. I'll have to check archives for
bge(4). I'll commit the em(4) patch soon.

What you see in ssh session and lack of response for ICMP echo
request indicates other issues. I can't sure but it may not related
with network drivers at all(eg. sharing interrupt with other devices).

 > I'm still not able to really reproduce the SATA problem others are
 > reporting, besides forcing em0 to share its interrupt with the
 > SATA controller. This can easily be avoided - at least with our
 > hardware.
 > 
 > 
 > Regards,
 > 
 > Patrick M. Hausen
 > Leiter Netzwerke und Sicherheit
 > -- 
 > punkt.de GmbH         Internet - Dienstleistungen - Beratung
 > Vorholzstr. 25        Tel. 0721 9109 -0 Fax: -100
 > 76137 Karlsruhe       http://punkt.de
 > _______________________________________________

-- 
Regards,
Pyun YongHyeon


More information about the freebsd-stable mailing list