Timing issue with Dummynet on high kernel timer interrupt

Sat Nov 7 06:33:43 UTC 2015

On Sat, 7 Nov 2015 01:51:29 +0000, Rasool Al-Saadi wrote:
 > On Saturday, 7 November 2015 2:05 AM,  Hans Petter Selasky wrote:
 > > On 11/06/15 11:08, Luigi Rizzo wrote:
 > > > On Fri, Nov 6, 2015 at 10:52 AM, Hans Petter Selasky <hps at selasky.org>
 > > wrote:
 > > >> On 11/06/15 09:50, Luigi Rizzo wrote:
 > > >>>
 > > >>> On Fri, Nov 6, 2015 at 9:44 AM, Hans Petter Selasky
 > > >>> <hps at selasky.org>
 > > >>> wrote:
 > > > ...
 > > >>>> Hi,
 > > >>>>
 > > >>>> The C_DIRECT_EXEC flag reduces task switching overhead, that you
 > > >>>> don't have to wakeup a thread to wakeup the dummynet worker
 > > thread.
 > > >>>> It affects timing.
 > > >>>
 > > >>>
 > > >>> Hans,
 > > >>> thanks for the explanation.
 > > >>>
 > > >>> Can you clarify the behaviour of C_DIRECT_EXEC ?
 > > >>> Does this mean that the task is run within some common thread
 > > >>> instead of a dedicated one ?
 > > >>
 > > >>
 > > >> Hi Luigi,
 > > >>
 > > >> C_DIRECT_EXEC means that the timer callback is executed directly from
 > > >> the fast interrupt filter of the timer or IPI.
 > > >>
 > > >>>
 > > >>> If so, for this type of task (dummynet may run at high rate and use
 > > >>> a significant amount of cpu time) it may be a good idea to remove
 > > >>> C_DIRECT_EXEC altogether.
 > > >>
 > > >>
 > > >> The ipfw dummynet code is not run from the timer callback. It is run
 > > >> from a taskqueue. I suspect there is likely a bug somewhere. At the
 > > >> moment it is not clear to me if there is a bug in the callout
 > > >> subsystem, that the when the timer is started with 1 tick delay it
 > > >> doesn't kick in until after 50ms or so at HZ=4000. Or if the dummynet's
 > > task is doing a lot of work for 50ms.
 > > >> I think we need some more information to nail this one.
 > > >
 > > > It certainly does not run for 50ms, but it might occasionally keep the
 > > > thread busy for some 10-50us (I doubt it is longer than that) and
 > > > possibly cause the reschedule request to fall into the interval where
 > > > it should actually run.
 > > >
 > > > So if your theory is correct, it may well be that the callout system
 > > > sees the request "in the past" (possibly as a result as some incorrect
 > > > wraparound, or undefined behaviour on integer wraps) and then the
 > > > event is only recovered when the callout wheel (or whatever is the
 > > > underlying implementation) happens to go again through the entry.
 > > >
 > > > What is so magic in the values we see (400 or 600 or 40ms) i have no idea.
 > > >
 > > 
 > > Rasool:
 > > 
 > > It might be worth trying to set:
 > > 
 > > kern.eventtimer.periodic=1
 > > 
 > > In /boot/loader.conf . Can you test that too?
 > > 
 > > You need to reboot before the setting takes into effect.

I've been trying to study and document the behaviour that gives rise to 
these '0.6' load averages often seen at even 99.9-100% idle on laptops 
especially, and here (Lenovo X200 2.4GHz Core2Duo on recent stable/9) it 
depends entirely on which eventtimer is selected.

Here the issue is with HPET - whether in one-shot or periodic mode - 
while LAPIC, in either mode, shows 0.00 LAs after 15 minutes.  However 
LAPIC consumes ~40% more power, running ~20% hotter on battery as it 
only drops to (~100%) C2 state, whereas HPET runs over 95% in C3 in the 
same idle scenario.

A friend reports the same (0.6-0.7) LAs on a Digital Ocean 'droplet' on 
one core of an 8-core Xeon under KVM using LAPIC, and Jeremy Chadwick 
kicked this off in stable@ (re 10.x) showing ~0.15 LAs on bare metal, 
also with LAPIC, so it does appear there may be some deeper issues.

I have however, seen no obvious[*] problem switching between eventtimers 
and en/disabling .periodic on the fly by sysctl, and observing immediate 
effects via top -SCHP, vmstat -w, and especially systat -vm.  Recently 
I've been playing with kern.eventtimer.idletick as well, and in the next 
few days hope to collate results of my tests into something coherent.

[*] except that it's necessary to remove then reapply AC power to have 
dev.cpu.N.cx_usage properly reflect the C-state stats, especially when 
switching to LAPIC from HPET on the fly, which is no doubt officially
unsupported, though it does instantly 'fix' the load average issue.

I also notice that with HPET, the timer interrupts themselves, whether 
as few as 60-70/s idle in one-shot mode or ~2000/s in periodic, appear 
in systat -vm or vmstat -w10 'Int' column, whereas with LAPIC or (yes I 
even tried) i8254, these timer interrupts don't count in that column.

I'll try Hans' suggestion of adjusting in loader.conf, rebooting between 
various options if really needed, though that'll be far more tedious .. 
[ I've just now seen Bruce's latest, so maybe I haven't been doing the 
wrong thing playing with these after all? ]

I've been struggling to express this clearly over on stable@ against 
ongoing concensus that the (symptomatic) load average issue is merely 
'cosmetic,' but my first thought when seeing $subject issue was to ask 
about values of sysctls kern.timecounter and kern.eventtimer and for a 
systat -vm shot or two while running your tests .. I now wish I had.

So could you show those now, so your clocks/timers config is clear?

Excuse a long rather pent-up rave, but I could be away for a few days.

(Kind of amusing that you're trying for best performance at up to 10kHz 
ticker, while I'm hoping for at least vaguely accurate LAs at idle :)

cheers, Ian

 > Hans,
 > 
 > Yes, this solves the problem!
 > I will do more checking when I am near my testbed.
 > Thanks for your effort and time!
 > 
 > Cheers,
 > Rasool
 > > 
 > > Luigi:
 > > 
 > > I'm wondering if there is a problem with:
 > > 
 > > cpu_new_callout(a,b,c);
 > > 
 > > --HPS