cx_lowest and CPU usage

Tue Feb 19 20:09:07 UTC 2008

More news about the issue. Finally, I stopped being blind and tried to
use KTR.
Graphs produces by schedgraph confirmed my suspicion that system spends
most if its time in idle thread (sleeping).
But, but, RTC interrupt hits with high precision into the relatively
very short intervals when swi4 runs.
It seems that relatively long transition times of C2 on this hardware
cause this "synchronization" but I am not sure exactly how. It is even
harder to say how long does it take the hardware (chipset+cpu) to wake
from that sleep on interrupt.
One interesting thing that I immediately noticed is that having DUMMYNET
in my kernel caused a huge difference.
A typical KTR sequence looks like this:
1. we are in idle thread (apparently sleeping in C2)
2. clock (IRQ0) interrupt handler is executed (apparently after CPU is
waken and idle thread enabled interrupts after post-C2 processing)
3. hardclock() checks if there is any timeout to be processed by
softclock, and thanks to dummynet, we have its callout scheduled each tick
4. hardclock() schedules swi4 to run

I am fuzzy about the following step, I see traces, but I don't know the
code. So I write my guess and knowledgeable people can correct me.
5. near the end of interrupt handler routine we have critical_exit()
call, it can call mi_switch() if current thread needs to be preempted by
a different thread; in this case interrupt code is executed in context
of the system idle thread and the idle thread is "owing preemption" to
swi4 (see step4)
6. so apparently mi_switch happens in the idle thread and execution
jumps to swi4 thread, which now exits from mi_switch() in soft interrupt
loop routine
7. so far, starting from step 2, execution is with disabled interrupts;
now swi4 unlocks spin mutex lock and that enables interrupts (see the
previous message quoted below)
8. "immediately", RTC (IRQ8) interrupt handler is executed, apparently
it has been pending for some time now
9. statclock is called and because we are on a swi thread this rtc tick
is charged to "interrupts"

So, here we are. But I am still not sure why the above happens so
"deterministically" - RTC interrupt can happen anywhere between 2 clock
interrupts. So we have 1000us range for RTC and 90us of C2 exit delay,
that shouldn't be enough to always put IRQ0 and IRQ8 "together".

Anyway, I excluded DUMMYNET from kernel config, but kept SCHED_ULE and
HZ=1000. Now with C2 enabled I get 15-20% of (reported!) interrupt load
on completely idle system. That is still a lot, but is much less than
99% I had before :-)

Now, without DUMMYNET, I will get new KTR dumps for cx_lowest=C1 and
cx_lowest=C2. I hope that this much cleaner experiment would give more
insights.

But even now we can be sure that our cpu statistics code is not always
adequate. This is because IRQ8 is not processed like some
super-interrupt that looks on a system from above. It is a regular
participant of the system, it competes with IRQ0 and is affected by
scheduler's decisions. Especially those made while in interrupt handler.

on 17/02/2008 17:10 Andriy Gapon said the following:
> on 14/02/2008 23:31 Andriy Gapon said the following:
>> I ran a series of tests, repeating each twice to be sure that I didn't
>> make any mistake.
>> All tests were performed in single-user mode, so the system was as idle
>> as possible, top reported idle as 99.N% - 100%.
>> Then I set hw.acpi.cpu.cx_lowest=C2 and ran top again.
>> Here's the results:
>> GENERIC, SCHED_4BSD, default HZ=1000 ==> C2-interrupt 11-14% (!!)
>> GENERIC, SCHED_4BSD, kern.hz="100" ==> C2-interrupt 99-100%
>> customized kernel, SCHED_ULE, default HZ=1000 ==> C2-interrupt 99-100%
>> customized kernel, SCHED_ULE, kern.hz="100" ==> C2-interrupt 99-100%
> 
> I did some testing and here are some additional data.
> Running with sched_ule and hz=1000 I measured that mean time spent in C2
> state (from starting to go into it, till completely returning from it)
> is ~860us (end_time - start_time, no overhead adjustments).
> Additionally, 98.1% of sleeps lasted 800-900us, 1.8% lasted 700-800us,
> the rest is spread almost uniformly over the remaining intervals of
> range 0-1000%.
> 
> Here's a typical backtrace for statclock execution:
> #9  0xc04cce26 in statclock (usermode=0) at
> /system/src/sys/kern/kern_clock.c:447
> #10 0xc06b5f0c in rtcintr (frame=0xcfb30c78) at
> /system/src/sys/i386/isa/clock.c:236
> #11 0xc06a5275 in intr_execute_handlers (isrc=0xc0736900,
> frame=0xcfb30c78) at /system/src/sys/i386/i386/intr_machdep.c:362
> #12 0xc06b558c in atpic_handle_intr (vector=8, frame=0xcfb30c78) at
> /system/src/sys/i386/isa/atpic.c:596
> #13 0xc06a0d71 in Xatpic_intr8 () at atpic_vector.s:70
> #14 0xc06a70c9 in spinlock_exit () at cpufunc.h:365
> #15 0xc04e6993 in ithread_loop (arg=0xc22f5980) at
> /system/src/sys/kern/kern_intr.c:1137
> #16 0xc04e32c1 in fork_exit (callout=0xc04e6640 <ithread_loop>,
> arg=0xc22f5980, frame=0xcfb30d38) at /system/src/sys/kern/kern_fork.c:754
> #17 0xc06a0bc0 in fork_trampoline () at
> /system/src/sys/i386/i386/exception.s:205
> 
> I.e. swi4 kernel thread is "caught" by IRQ8 just when it exits from
> thread_unlock() in the following snippet:
> 1132                    if (!ithd->it_need && !(ithd->it_flags & IT_DEAD)) {
> 1133                            TD_SET_IWAIT(td);
> 1134                            ie->ie_count = 0;
> 1135                            mi_switch(SW_VOL, NULL);
> 1136                    }
> 1137                    thread_unlock(td);
> 
> This happens almost deterministically.
> 

-- 
Andriy Gapon