cx_lowest and CPU usage

Tue Feb 19 21:42:59 UTC 2008

Here are two traces, one for completely idle system (single-user) with
C2 disabled and the other with C2 enabled:
http://www.icyb.net.ua/~avg/c1.ktr.gz
http://www.icyb.net.ua/~avg/c2.ktr.gz
Traces are produced with KTR_ALL and taken for approx. 3 seconds
(sysctl; sleep 3; sysctl).

And a small untidy quick-hack perl script that calculates time
difference (in microseconds) between occurrence of a clock interrupt
(rtc or clk) and the previous trace line:
http://www.icyb.net.ua/~avg/ktr_pl.txt

Some interesting results:
C1, rtc: values all over the range ("uniformly"), from 1us to 989us;
C1, clk: also various values from 2us to 995us, with most values
("naturally") being at the upper end > 900us.

C2, clk: handful of values in the lower and middle range, vast majority
is higher than 900us
C2, rtc: all values are less than 4us (!!!)

The last result most probably means that RTC IRQ was not the interrupt
to wake CPU from sleeping state.
The first possibility that comes to mind is that on this particular
hardware RTC interrupt (IRQ8) is not able to wake the system from C2 state.
Is this possible?
Are there known instances of such breakage?
Are there other possible explanations?

on 19/02/2008 22:09 Andriy Gapon said the following:
> More news about the issue. Finally, I stopped being blind and tried to
> use KTR.
> Graphs produces by schedgraph confirmed my suspicion that system spends
> most if its time in idle thread (sleeping).
> But, but, RTC interrupt hits with high precision into the relatively
> very short intervals when swi4 runs.
> It seems that relatively long transition times of C2 on this hardware
> cause this "synchronization" but I am not sure exactly how. It is even
> harder to say how long does it take the hardware (chipset+cpu) to wake
> from that sleep on interrupt.
> One interesting thing that I immediately noticed is that having DUMMYNET
> in my kernel caused a huge difference.
> A typical KTR sequence looks like this:
> 1. we are in idle thread (apparently sleeping in C2)
> 2. clock (IRQ0) interrupt handler is executed (apparently after CPU is
> waken and idle thread enabled interrupts after post-C2 processing)
> 3. hardclock() checks if there is any timeout to be processed by
> softclock, and thanks to dummynet, we have its callout scheduled each tick
> 4. hardclock() schedules swi4 to run
> 
> I am fuzzy about the following step, I see traces, but I don't know the
> code. So I write my guess and knowledgeable people can correct me.
> 5. near the end of interrupt handler routine we have critical_exit()
> call, it can call mi_switch() if current thread needs to be preempted by
> a different thread; in this case interrupt code is executed in context
> of the system idle thread and the idle thread is "owing preemption" to
> swi4 (see step4)
> 6. so apparently mi_switch happens in the idle thread and execution
> jumps to swi4 thread, which now exits from mi_switch() in soft interrupt
> loop routine
> 7. so far, starting from step 2, execution is with disabled interrupts;
> now swi4 unlocks spin mutex lock and that enables interrupts (see the
> previous message quoted below)
> 8. "immediately", RTC (IRQ8) interrupt handler is executed, apparently
> it has been pending for some time now
> 9. statclock is called and because we are on a swi thread this rtc tick
> is charged to "interrupts"
> 
> So, here we are. But I am still not sure why the above happens so
> "deterministically" - RTC interrupt can happen anywhere between 2 clock
> interrupts. So we have 1000us range for RTC and 90us of C2 exit delay,
> that shouldn't be enough to always put IRQ0 and IRQ8 "together".
> 
> Anyway, I excluded DUMMYNET from kernel config, but kept SCHED_ULE and
> HZ=1000. Now with C2 enabled I get 15-20% of (reported!) interrupt load
> on completely idle system. That is still a lot, but is much less than
> 99% I had before :-)
> 
> Now, without DUMMYNET, I will get new KTR dumps for cx_lowest=C1 and
> cx_lowest=C2. I hope that this much cleaner experiment would give more
> insights.
> 
> But even now we can be sure that our cpu statistics code is not always
> adequate. This is because IRQ8 is not processed like some
> super-interrupt that looks on a system from above. It is a regular
> participant of the system, it competes with IRQ0 and is affected by
> scheduler's decisions. Especially those made while in interrupt handler.
> 
> 
> 
> on 17/02/2008 17:10 Andriy Gapon said the following:
>> on 14/02/2008 23:31 Andriy Gapon said the following:
>>> I ran a series of tests, repeating each twice to be sure that I didn't
>>> make any mistake.
>>> All tests were performed in single-user mode, so the system was as idle
>>> as possible, top reported idle as 99.N% - 100%.
>>> Then I set hw.acpi.cpu.cx_lowest=C2 and ran top again.
>>> Here's the results:
>>> GENERIC, SCHED_4BSD, default HZ=1000 ==> C2-interrupt 11-14% (!!)
>>> GENERIC, SCHED_4BSD, kern.hz="100" ==> C2-interrupt 99-100%
>>> customized kernel, SCHED_ULE, default HZ=1000 ==> C2-interrupt 99-100%
>>> customized kernel, SCHED_ULE, kern.hz="100" ==> C2-interrupt 99-100%
>> I did some testing and here are some additional data.
>> Running with sched_ule and hz=1000 I measured that mean time spent in C2
>> state (from starting to go into it, till completely returning from it)
>> is ~860us (end_time - start_time, no overhead adjustments).
>> Additionally, 98.1% of sleeps lasted 800-900us, 1.8% lasted 700-800us,
>> the rest is spread almost uniformly over the remaining intervals of
>> range 0-1000%.
>>
>> Here's a typical backtrace for statclock execution:
>> #9  0xc04cce26 in statclock (usermode=0) at
>> /system/src/sys/kern/kern_clock.c:447
>> #10 0xc06b5f0c in rtcintr (frame=0xcfb30c78) at
>> /system/src/sys/i386/isa/clock.c:236
>> #11 0xc06a5275 in intr_execute_handlers (isrc=0xc0736900,
>> frame=0xcfb30c78) at /system/src/sys/i386/i386/intr_machdep.c:362
>> #12 0xc06b558c in atpic_handle_intr (vector=8, frame=0xcfb30c78) at
>> /system/src/sys/i386/isa/atpic.c:596
>> #13 0xc06a0d71 in Xatpic_intr8 () at atpic_vector.s:70
>> #14 0xc06a70c9 in spinlock_exit () at cpufunc.h:365
>> #15 0xc04e6993 in ithread_loop (arg=0xc22f5980) at
>> /system/src/sys/kern/kern_intr.c:1137
>> #16 0xc04e32c1 in fork_exit (callout=0xc04e6640 <ithread_loop>,
>> arg=0xc22f5980, frame=0xcfb30d38) at /system/src/sys/kern/kern_fork.c:754
>> #17 0xc06a0bc0 in fork_trampoline () at
>> /system/src/sys/i386/i386/exception.s:205
>>
>> I.e. swi4 kernel thread is "caught" by IRQ8 just when it exits from
>> thread_unlock() in the following snippet:
>> 1132                    if (!ithd->it_need && !(ithd->it_flags & IT_DEAD)) {
>> 1133                            TD_SET_IWAIT(td);
>> 1134                            ie->ie_count = 0;
>> 1135                            mi_switch(SW_VOL, NULL);
>> 1136                    }
>> 1137                    thread_unlock(td);
>>
>> This happens almost deterministically.
>>
> 
> 

-- 
Andriy Gapon