New interrupt stuff breaks ASUS 2 CPU system

Harti Brandt brandt at fokus.fraunhofer.de
Wed Nov 5 04:07:49 PST 2003


On Wed, 5 Nov 2003, Harti Brandt wrote:

HB>On Tue, 4 Nov 2003, John Baldwin wrote:
HB>
HB>JB>
HB>JB>On 04-Nov-2003 Harti Brandt wrote:
HB>JB>> On Tue, 4 Nov 2003, Harti Brandt wrote:
HB>JB>>
HB>JB>> HB>On Tue, 4 Nov 2003, John Baldwin wrote:
HB>JB>> HB>
HB>JB>> HB>JB>
HB>JB>> HB>JB>On 04-Nov-2003 Harti Brandt wrote:
HB>JB>> HB>JB>>
HB>JB>> HB>JB>> Hi,
HB>JB>> HB>JB>>
HB>JB>> HB>JB>> I have an ASUS system with 2 CPUs that I need to run at HZ=10000. This
HB>JB>> HB>JB>> worked until yesterday, but with the new interrupt code it doesn't boot
HB>JB>> HB>JB>> anymore. It works for the standard HZ, but if I set HZ=1000 I get a double
HB>JB>> HB>JB>> fault. I suspect a race condition in the interrupt handling. My config
HB>JB>> HB>JB>> file has
HB>JB>> HB>JB>>
HB>JB>> HB>JB>> options SMP
HB>JB>> HB>JB>> device apic
HB>JB>> HB>JB>> options HZ=1000
HB>JB>> HB>JB>
HB>JB>> HB>JB>Ok, I can try to reproduce.
HB>JB>> HB>JB>
HB>JB>> HB>JB>> Device configuration finished.
HB>JB>> HB>JB>> Timecounter "TSC" frequency 1380009492 Hz quality -100
HB>JB>> HB>JB>> Timecounters cpuid = 0; apic id = 00
HB>JB>> HB>JB>> instruction pointer   = 0x8:0xc048995d
HB>JB>> HB>JB>> stack pointer         = 0x10:0xc0821bf4
HB>JB>> HB>JB>> frame pointer        cpuid = 0; apic id = 00
HB>JB>> HB>JB>>
HB>JB>> HB>JB>> 0xc048995d is in critical_exit. It is the jmp after the popf from
HB>JB>> HB>JB>> cpu_critical_exit.
HB>JB>> HB>JB>
HB>JB>> HB>JB>This is where interrupts are re-enabled, so you are getting an interrupt.
HB>JB>> HB>JB>It might be helpful to figure what type of fault you are actually getting.
HB>JB>> HB>
HB>JB>> HB>tf_err is 0, tf_trapno is 30 (decimal).
HB>JB>>
HB>JB>> More information:
HB>JB>>
HB>JB>> I have replaced all the reserved vectors with individual ones, that set
HB>JB>> tf_err to the index (vector number). It appears the the vector number is
HB>JB>> 39 decimal. What does that mean?
HB>JB>
HB>JB>IRQ 7.
HB>JB>Can you post a verbose dmesg?  Also, can you try both with and without
HB>JB>ACPI?
HB>
HB>Attached are both dmesgs.
HB>
HB>More datapoints:
HB>
HB>I had the parallel port (irq7) and the second sio disabled in the BIOS.
HB>After enabling both I now get a panic in lapic_handle_intr: Couldn't get
HB>vector from ISR! After fetching the relevant docs from intel I checked the
HB>registers of the apic pointed to by lapic. The interrupt taken is
HB>Xapic_irq1. isr1 is zero, but irr1 is 0x100 (that was without ACPI). How
HB>may that happen? As I understand ISR are the interrupts that have been
HB>delivered to the CPU so if it is interrupted a bit should be set, correct?
HB>
HB>I then have replaced the panic by a printf() followed by a return. Now the
HB>system comes to live, but I get a couple of these warnings. When the
HB>system is idle everyting seems fine, but when I start my simulation
HB>application (which normally generates between 20k and 250k interrupts/sec
HB>depending on the MPSAFE setting of the ATM drivers) I get approx 1-2 of
HB>these messages per second (this is with HZ=1000).
HB>
HB>A question while reading the code: what does the global lapic variable
HB>refer to? As I understand every CPU has its local APIC. Does it point to
HB>one of those two? To which?

An additional point. In the above test where I got 1-2 message per second
I have now disabled a debugging printout in the ATM driver that gave 3-4
messages per second (from the interrupt handler). Now the 'Couldn't
get...' messages have disappeared. So this really looks like a race
somewhere. Is it possible that the bit in the ISR gets somehow cleared
between the point where the interrupt is handed to the processor but
before the Xapic_irq1 really runs and sees that bit? Perhaps from another
Xapic_irq1 instance or whatever?

harti
-- 
harti brandt,
http://www.fokus.fraunhofer.de/research/cc/cats/employees/hartmut.brandt/private
brandt at fokus.fraunhofer.de, harti at freebsd.org


More information about the freebsd-current mailing list