cx_lowest and CPU usage

Andriy Gapon avg at
Wed Feb 20 18:34:01 UTC 2008

on 19/02/2008 23:42 Andriy Gapon said the following:
> The last result most probably means that RTC IRQ was not the interrupt
> to wake CPU from sleeping state.
> The first possibility that comes to mind is that on this particular
> hardware RTC interrupt (IRQ8) is not able to wake the system from C2 state.

So it seems that this was true.
Here's a shortcut to the relevant info:
PIIX4E (FW82371EB) specification
DEVACTB — DEVICE ACTIVITY B (FUNCTION 3) pci register description
BRLD_EN_IRQ8, bit 5

$ pciconf -r pci0:0:7:3 0x58

> Is this possible?
> Are there known instances of such breakage?
> Are there other possible explanations?
> on 19/02/2008 22:09 Andriy Gapon said the following:
>> More news about the issue. Finally, I stopped being blind and tried to
>> use KTR.
>> Graphs produces by schedgraph confirmed my suspicion that system spends
>> most if its time in idle thread (sleeping).
>> But, but, RTC interrupt hits with high precision into the relatively
>> very short intervals when swi4 runs.
>> It seems that relatively long transition times of C2 on this hardware
>> cause this "synchronization" but I am not sure exactly how. It is even
>> harder to say how long does it take the hardware (chipset+cpu) to wake
>> from that sleep on interrupt.
>> One interesting thing that I immediately noticed is that having DUMMYNET
>> in my kernel caused a huge difference.
>> A typical KTR sequence looks like this:
>> 1. we are in idle thread (apparently sleeping in C2)
>> 2. clock (IRQ0) interrupt handler is executed (apparently after CPU is
>> waken and idle thread enabled interrupts after post-C2 processing)
>> 3. hardclock() checks if there is any timeout to be processed by
>> softclock, and thanks to dummynet, we have its callout scheduled each tick
>> 4. hardclock() schedules swi4 to run
>> I am fuzzy about the following step, I see traces, but I don't know the
>> code. So I write my guess and knowledgeable people can correct me.
>> 5. near the end of interrupt handler routine we have critical_exit()
>> call, it can call mi_switch() if current thread needs to be preempted by
>> a different thread; in this case interrupt code is executed in context
>> of the system idle thread and the idle thread is "owing preemption" to
>> swi4 (see step4)
>> 6. so apparently mi_switch happens in the idle thread and execution
>> jumps to swi4 thread, which now exits from mi_switch() in soft interrupt
>> loop routine
>> 7. so far, starting from step 2, execution is with disabled interrupts;
>> now swi4 unlocks spin mutex lock and that enables interrupts (see the
>> previous message quoted below)
>> 8. "immediately", RTC (IRQ8) interrupt handler is executed, apparently
>> it has been pending for some time now
>> 9. statclock is called and because we are on a swi thread this rtc tick
>> is charged to "interrupts"
>> So, here we are. But I am still not sure why the above happens so
>> "deterministically" - RTC interrupt can happen anywhere between 2 clock
>> interrupts. So we have 1000us range for RTC and 90us of C2 exit delay,
>> that shouldn't be enough to always put IRQ0 and IRQ8 "together".
>> Anyway, I excluded DUMMYNET from kernel config, but kept SCHED_ULE and
>> HZ=1000. Now with C2 enabled I get 15-20% of (reported!) interrupt load
>> on completely idle system. That is still a lot, but is much less than
>> 99% I had before :-)
>> Now, without DUMMYNET, I will get new KTR dumps for cx_lowest=C1 and
>> cx_lowest=C2. I hope that this much cleaner experiment would give more
>> insights.
>> But even now we can be sure that our cpu statistics code is not always
>> adequate. This is because IRQ8 is not processed like some
>> super-interrupt that looks on a system from above. It is a regular
>> participant of the system, it competes with IRQ0 and is affected by
>> scheduler's decisions. Especially those made while in interrupt handler.
>> on 17/02/2008 17:10 Andriy Gapon said the following:
>>> on 14/02/2008 23:31 Andriy Gapon said the following:
>>>> I ran a series of tests, repeating each twice to be sure that I didn't
>>>> make any mistake.
>>>> All tests were performed in single-user mode, so the system was as idle
>>>> as possible, top reported idle as 99.N% - 100%.
>>>> Then I set hw.acpi.cpu.cx_lowest=C2 and ran top again.
>>>> Here's the results:
>>>> GENERIC, SCHED_4BSD, default HZ=1000 ==> C2-interrupt 11-14% (!!)
>>>> GENERIC, SCHED_4BSD, kern.hz="100" ==> C2-interrupt 99-100%
>>>> customized kernel, SCHED_ULE, default HZ=1000 ==> C2-interrupt 99-100%
>>>> customized kernel, SCHED_ULE, kern.hz="100" ==> C2-interrupt 99-100%
>>> I did some testing and here are some additional data.
>>> Running with sched_ule and hz=1000 I measured that mean time spent in C2
>>> state (from starting to go into it, till completely returning from it)
>>> is ~860us (end_time - start_time, no overhead adjustments).
>>> Additionally, 98.1% of sleeps lasted 800-900us, 1.8% lasted 700-800us,
>>> the rest is spread almost uniformly over the remaining intervals of
>>> range 0-1000%.
>>> Here's a typical backtrace for statclock execution:
>>> #9  0xc04cce26 in statclock (usermode=0) at
>>> /system/src/sys/kern/kern_clock.c:447
>>> #10 0xc06b5f0c in rtcintr (frame=0xcfb30c78) at
>>> /system/src/sys/i386/isa/clock.c:236
>>> #11 0xc06a5275 in intr_execute_handlers (isrc=0xc0736900,
>>> frame=0xcfb30c78) at /system/src/sys/i386/i386/intr_machdep.c:362
>>> #12 0xc06b558c in atpic_handle_intr (vector=8, frame=0xcfb30c78) at
>>> /system/src/sys/i386/isa/atpic.c:596
>>> #13 0xc06a0d71 in Xatpic_intr8 () at atpic_vector.s:70
>>> #14 0xc06a70c9 in spinlock_exit () at cpufunc.h:365
>>> #15 0xc04e6993 in ithread_loop (arg=0xc22f5980) at
>>> /system/src/sys/kern/kern_intr.c:1137
>>> #16 0xc04e32c1 in fork_exit (callout=0xc04e6640 <ithread_loop>,
>>> arg=0xc22f5980, frame=0xcfb30d38) at /system/src/sys/kern/kern_fork.c:754
>>> #17 0xc06a0bc0 in fork_trampoline () at
>>> /system/src/sys/i386/i386/exception.s:205
>>> I.e. swi4 kernel thread is "caught" by IRQ8 just when it exits from
>>> thread_unlock() in the following snippet:
>>> 1132                    if (!ithd->it_need && !(ithd->it_flags & IT_DEAD)) {
>>> 1133                            TD_SET_IWAIT(td);
>>> 1134                            ie->ie_count = 0;
>>> 1135                            mi_switch(SW_VOL, NULL);
>>> 1136                    }
>>> 1137                    thread_unlock(td);
>>> This happens almost deterministically.

Andriy Gapon

More information about the freebsd-acpi mailing list