powerd broken

Sun Mar 29 08:06:59 PDT 2009

Dominic Fandrey wrote:
> Dominic Fandrey wrote:
>> Alexander Motin wrote:
>>> Dominic Fandrey wrote:
>>>> Alexander Motin wrote:
>>>>> It means that one of your CPUs spent most of it's time in interrupt
>>>>> processing and so far from idle. What does `top -P` shows you? Where
>>>>> have you seen that ~6% CPU load?
>>>> That is the load shown by the e17 CPU module. It's display has always
>>>> been in sync with top in the past, no longer though, it appears.
>>>>
>>>> # top -PIS
>>>> last pid: 68235;  load averages:  0.09,  0.16, 
>>>> 0.17                                                                                                                               
>>>> up 0+05:17:29  13:05:10
>>>> 137 processes: 4 running, 117 sleeping, 16 waiting
>>>> CPU 0:  1.1% user,  0.0% nice,  1.1% system, 61.4% interrupt, 36.3% idle
>>>> CPU 1:  9.0% user,  0.0% nice,  3.4% system,  0.0% interrupt, 87.6% idle
>>>> Mem: 419M Active, 415M Inact, 416M Wired, 3752K Cache, 183M Buf, 716M
>>>> Free
>>>> Swap: 4096M Total, 4096M Free
>>>>
>>>>   PID USERNAME    THR PRI NICE   SIZE    RES STATE  C   TIME   WCPU
>>>> COMMAND
>>>>    11 root          1 171 ki31     0K    16K RUN    1 286:42 93.46%
>>>> idle: cpu1
>>>>    23 root          1 -80    -     0K    16K RUN    0 126:54 59.96%
>>>> irq16: hdac0 uhci+
>>>>    12 root          1 171 ki31     0K    16K RUN    0 179:27 40.09%
>>>> idle: cpu0
>>>>  1318 root          1  46    0   439M   312M select 1  12:55  1.76% Xorg
>>>>  4361 musicpd       4  44    0 91164K 14412K ucond  1   1:57  0.78% mpd
>>>>
>>>> Some things strike me as odd. The difference between the load reported
>>>> by powerd and top is still very significant and of course the high
>>>> interrupt load.
>>> powerd now reports/uses summary load of all CPUs (it can be bigger then
>>> 100%), while top shows average.
>>>
>>>> I've got a mouse with a 1khz report rate (the only connected USB
>>>> device), but unplugging it doesn't change the load. Neither does
>>>> stopping moused (I'm running the system without HAL). There also
>>>> is a fingerprint reader, but it is only detected by ugen.
>>> I would start from identifying all devices sharing that IRQ and trying
>>> to disable them (or unload their drivers) one by one. `systat -vm 1`
>>> will show you how much interrupts actually happens there per second.
>>>
>>> On most of modern systems you can make hdac0 to not share that IRQ by
>>> enabling MSI there with hint.hdac.0.msi=1 in loader.conf.
>>>
>> I already did unplug all devices to no avail. Afterwards I tried
>> to 'kldunload -f' all usb devices, but most refused. However during
>> this I recognized that /boot/modules holds an old u3g.ko, back from
>> the time when it was not in stable. Apparently modules in
>> /boot/modules have preference over those in /boot/kernels. I deleted
>> the old u3g.ko and rebooted (because I couldn't get the system to
>> unload it). Now everything is fine, the interrupt load is gone.
>>
>> I'll monitor this and come back here if it turns out this has just
>> been coincidence.
>>
>> Thank you for all that help.
>>
>> Regards
> 
> This has turned out to be an early call. I just happened to start
> skype_devel and the whole thing started over, so I think it's
> pretty certain, now, that this is a hdac issue.
> However after a reboot I tried to reproduce this and now skype
> and mpd are both running without causing an irq race.
> 
> If the irq race reappears I will use the device hint you suggested
> to rule out that this is an interrupt sharing issue.

Even after setting the device hint the problem has reoccurred.
It simply starts after a while, I cannot make out any cause. As it
turned out uhci0 is the violent interrupt with an interrupt rate
between 130k and 190k. This is so incredible, I wonder why the interrupt
throttling doesn't kick in.

    7 users    Load  0.16  0.27  0.28                  29 Mar 17:02

Mem:KB    REAL            VIRTUAL                       VN PAGER   SWAP PAGER
        Tot   Share      Tot    Share    Free           in   out     in   out
Act  805472   76800  1252404   123696  615176  count
All 1062992   97920 14120720   152840          pages
Proc:                                                            Interrupts
  r   p   d   s   w   Csw  Trp  Sys  Int  Sof  Flt        cow    182k total
          6  84      360k 2770  14k 178k  224   26     12 zfod        atkbd0 1
                                                          ozfod       acpi0 irq9
 1.1%Sys  25.9%Intr  0.4%User  0.0%Nice 72.6%Idle        %ozfod       psm0 irq12
|    |    |    |    |    |    |    |    |    |    |       daefr       ata0 irq14
=+++++++++++++                                         15 prcfr  177k uhci0 drm0
                                         9 dtbuf      117 totfr    23 wpi0 uhci1
Namei     Name-cache   Dir-cache    100000 desvn          react       ehci0 uhci
   Calls    hits   %    hits   %      7632 numvn          pdwak       uhci2 ehci
      38      38 100                  3663 frevn          pdpgs   457 uhci3 21
                                                          intrn  2008 cpu0: time
Disks   ad4   cd0   sg0 pass0                      473048 wire    188 hdac0 256
KB/t   0.00  0.00  0.00  0.00                      653788 act         bge0 257 
tps       0     0     0     0                      274796 inact  2008 cpu1: time
MB/s   0.00  0.00  0.00  0.00                       20192 cache      
%busy     0     0     0     0                      594984 free

I can rule out drm0 as the cause, because uhci0 is the only common
presence in all occurrences of this problem.