How to obtain which interrupts cause system to hang?

Mon Oct 11 04:27:44 UTC 2010

On Sun, 10 Oct 2010 19:27:05 +0300, kes-kes at yandex.ru wrote:
 > Hi, Ian.

Hi Eugen,

 >  >>  >> 23.1%Sys  50.8%Intr  1.3%User  0.0%Nice 24.8%Idle        %ozfod  1999 cpu0: time
 >  >>  >> |    |    |    |    |    |    |    |    |    |    |       daefr
 >  >>  >> ============+++++++++++++++++++++++++>                  6 prcfr
 >  >> 
 >  >> IS> Yes, system and esp. interrupt time is heavy .. 23k context switches!?

[..]

 >  >> IS> Disable p4tcc if it's a modern CPU; that usually hurts more than helps.
 >  >> IS> Disable polling if you're using that .. you haven't provided much info,
 >  >> IS> like is this with any network load, despite nfe0 showing no interrupts?
 > 
 >  >> Polling is ON. Traffice is about 60Mbit/s routed from nfe0 to vlan4 on rl0
 >  >> when interrupts are happen traffic slow down to 25-30Mbit/s.
 > 
 > IS> Out of my depth.  If it's a net problem - maybe not - you may do better
 > IS> in freebsd-net@ if you provide enough information (dmesg plus ifconfig,
 > IS> vmstat -i etc, normally and while this problem is happening).

[..]

 >  >>  >> How to obtain what nasty happen, which process take 36-50% of CPU
 >  >>  >> resource?
 >  >> 
 >  >> IS> Try 'top -S'. It's almost certainly system process[es], not shown above.
 > 
 > IS> Does that not show anything?  Also, something like 'ps auxww | less' 
 > IS> should show you what's using all that CPU.  I'm out of wild clues.
 > 
 > vpn_shadow# top -S
 > last pid: 57879;  load averages:  0.12,  0.06,  0.05       up 1+18:37:39  19:19:14

Ok, this was taken when things were't so busy as the earlier 36-50% ..

 > 101 processes: 2 running, 83 sleeping, 16 waiting
 > CPU:  0.0% user,  0.0% nice, 14.3% system, 17.3% interrupt, 68.4% idle
 > Mem: 319M Active, 799M Inact, 354M Wired, 336K Cache, 213M Buf, 503M Free
 > Swap: 4063M Total, 4063M Free
 > 
 >   PID USERNAME    THR PRI NICE   SIZE    RES STATE    TIME   WCPU COMMAND
 >    11 root          1 171 ki31     0K    16K RUN     24.9H 86.47% idle: cpu0
 >    14 root          1 -44    -     0K    16K WAIT   689:52 10.25% swi1: net
 >     2 root          1 -68    -     0K    16K sleep  207:35  4.69% ng_queue0
 >    40 root          1 -68    -     0K    16K -      101:37  1.46% dummynet

.. but still if you add up the TIMEs above here it comes to about 41.5 
hours, all but about half an hour of your total uptime, most of which is 
consumed by the next three below, so swi1 and ng_queue look like what's 
using most CPU long-term.

 >    47 root          1  20    -     0K    16K syncer   5:29  0.29% syncer
 >    12 root          1 -32    -     0K    16K WAIT    14:48  0.00% swi4: clock sio
 >    15 root          1 -16    -     0K    16K -        5:39  0.00% yarrow
 >   986 root          1  44    0  5692K  1408K select   1:29  0.00% syslogd
 >  1054 bind          4   4    0   138M   113M kqread   1:22  0.00% named
 >  1162 clamav        1   4    0  4616K  1468K accept   0:59  0.00% smtp-gated

Smells net-related to me, maybe polling, but like I said, I'm out of my 
depth.  You should have enough info to take to freebsd-net@ anyway.

cheers, Ian

PS: I still think you should take the time to close PR kern/129103 :)