high load system do not take all CPU time

Mon Dec 26 18:52:18 UTC 2011

Здравствуйте, Коньков.

Вы писали 25 декабря 2011 г., 18:10:17:

КЕ> Здравствуйте, wishmaster.

КЕ> Вы писали 19 декабря 2011 г., 6:54:08:

w>>   --- Original message ---
w>>  From: "Коньков Евгений" <kes-kes at yandex.ru>
w>>  To: "Daniel Staal" <DStaal at usa.net>
w>>   Date: 18 December 2011, 19:47:40
w>>  Subject: Re[2]: high load system do not take all CPU time
w>>  
w>>  

>>> Здравствуйте, Daniel.
>>> 
>>> Вы писали 18 декабря 2011 г., 17:52:00:
>>> 
>>> DS> --As of December 17, 2011 10:29:42 AM +0200, Коньков Евгений 
>>> DS> is alleged to have said:
>>> 
>>> >> How to debug why system do not use free CPU resouces?
>>> >>
>>> >> On this pictures you can see that CPU can not exceed 400tics
>>> >> http://piccy.info/view3/2368839/c9022754d5fcd64aff04482dd360b5b2/
>>> >> http://piccy.info/view3/2368837/a12aeed98681ed10f1a22f5b5edc5abc/
>>> >> http://piccy.info/view3/2368836/da6a67703af80eb0ab8088ab8421385c/
>>> >>
>>> >>
>>> >> On these pictures you can see that problems begin with trafic on re0
>>> >> when CPU load rise to "maximum"
>>> >> http://piccy.info/view3/2368834/512139edc56eea736881affcda490eca/
>>> >> http://piccy.info/view3/2368827/d27aead22eff69fd1ec2b6aa15e2cea3/
>>> >>
>>> >> But there is 25% CPU idle yet at that moment.
>>> 
>>> DS> <snip>
>>> 
>>> >># top -SIHP
>>> >> last pid: 93050;  load averages:  1.45,  1.41,  1.29
>>> >> up 9+16:32:06  10:28:43 237 processes: 5 running, 210 sleeping, 2
>>> >> stopped, 20 waiting
>>> >> CPU 0:  0.8% user,  0.0% nice,  8.7% system, 17.7% interrupt, 72.8% idle
>>> >> CPU 1:  0.0% user,  0.0% nice,  9.1% system, 20.1% interrupt, 70.9% idle
>>> >> CPU 2:  0.4% user,  0.0% nice,  9.4% system, 19.7% interrupt, 70.5% idle
>>> >> CPU 3:  1.2% user,  0.0% nice,  6.3% system, 22.4% interrupt, 70.1% idle
>>> >> Mem: 843M Active, 2476M Inact, 347M Wired, 150M Cache, 112M Buf, 80M Free
>>> >> Swap: 4096M Total, 15M Used, 4080M Free
>>> 
>>> DS> --As for the rest, it is mine.
>>> 
>>> DS> You are I/O bound; most of your time is spent in interrupts.  The CPU is
>>> DS> dealing with things as fast as it can get them, but it has to wait for the
>>> DS> disk and/or network card to get them to it.  The CPU is not your problem;
>>> DS> if you need more performance, you need to tune the I/O.  (And possibly get
>>> DS> better I/O cards, if available.)
>>> 
>>> DS> Daniel T. Staal
>>> 
>>> can I get interrupt limit or calculate it before that limit is
>>> reached?
>>> 
>>> interrupt source is internal card:
>>> # vmstat -i
>>> interrupt                          total       rate
>>> irq14: ata0                       349756         78
>>> irq16: ehci0                        7427          1
>>> irq23: ehci1                       12150          2
>>> cpu0:timer                      18268704       4122
>>> irq256: re0                     85001260      19178
>>> cpu1:timer                      18262192       4120
>>> cpu2:timer                      18217064       4110
>>> cpu3:timer                      18210509       4108
>>> Total                          158329062      35724
>>> 
>>> Have you any good I/O tuning links to read?
>>> 
>>> -- 
>>> С уважением,
>>> Коньков                          mailto:kes-kes at yandex.ru
w>>   
w>>  Your problem is in the poor performance LAN Card. Guy from
w>> Calomel Org told you about it. He advised you to change to Intel Network Card.

КЕ> see at time 17:20
КЕ> http://piccy.info/view3/2404329/dd9f28f8ac74d3d2f698ff14c305fe31/

КЕ> at this point freeradius start to work slow because of no CPU time is
КЕ> allocated to it or is allocated to little and mpd5 start to drop users because of no response
КЕ> from radius. I do not know what idle were on 'top', sadly.

КЕ> does SNMP return right values for CPU usage?

last pid: 14445;  load averages:  6.88,  5.69,  5.33                up 0+12:11:35  20:37:57
244 processes: 12 running, 211 sleeping, 3 stopped, 15 waiting, 3 lock
CPU 0:  4.7% user,  0.0% nice, 13.3% system, 46.7% interrupt, 35.3% idle
CPU 1:  2.0% user,  0.0% nice,  9.8% system, 69.4% interrupt, 18.8% idle
CPU 2:  2.7% user,  0.0% nice,  8.2% system, 74.5% interrupt, 14.5% idle
CPU 3:  1.2% user,  0.0% nice,  9.4% system, 78.0% interrupt, 11.4% idle
Mem: 800M Active, 2708M Inact, 237M Wired, 60M Cache, 112M Buf, 93M Free
Swap: 4096M Total, 25M Used, 4071M Free

  PID USERNAME   PRI NICE   SIZE    RES STATE   C   TIME   WCPU COMMAND
   12 root       -72    -     0K   160K CPU1    1 159:49 100.00% {swi1: netisr 3}
   12 root       -72    -     0K   160K *per-i  2 101:25 84.57% {swi1: netisr 1}
   12 root       -72    -     0K   160K *per-i  3  60:10 40.72% {swi1: netisr 2}
   12 root       -72    -     0K   160K *per-i  2  41:54 39.26% {swi1: netisr 0}
   11 root       155 ki31     0K    32K RUN     0 533:06 24.46% {idle: cpu0}
 3639 root        36    0 10460K  3824K CPU3    3   7:43 22.17% zebra
   12 root       -92    -     0K   160K CPU0    0  93:56 14.94% {irq256: re0}
   11 root       155 ki31     0K    32K RUN     1 563:29 14.16% {idle: cpu1}
   11 root       155 ki31     0K    32K RUN     2 551:46 12.79% {idle: cpu2}
   11 root       155 ki31     0K    32K RUN     3 558:54 11.52% {idle: cpu3}
   13 root       -16    -     0K    32K sleep   3  16:56  4.93% {ng_queue2}
   13 root       -16    -     0K    32K RUN     2  16:56  4.69% {ng_queue0}
   13 root       -16    -     0K    32K RUN     0  16:56  4.54% {ng_queue1}
   13 root       -16    -     0K    32K RUN     1  16:59  4.44% {ng_queue3}
 6818 root        22    0 15392K  4836K select  2  25:16  4.10% snmpd
49448 freeradius  29    0 27748K 16984K select  3   2:37  2.59% {initial thread}
16118 firebird    20  -10   233M   145M usem    2   0:06  0.83% {fb_smp_server}
14282 cacti       21    0 12000K  3084K select  3   0:00  0.68% snmpwalk
16118 firebird    20  -10   233M   145M usem    0   0:03  0.54% {fb_smp_server}
 5572 root        21    0   136M 78284K wait    1   5:23  0.49% {mpd5}
14507 root        20    0  9536K  1148K nanslp  0   0:51  0.15% monitord
14441 root        25    0 11596K  4048K CPU0    0   0:00  0.00% perl5.14.1
14443 cacti       21    0 11476K  2920K piperd  0   0:00  0.00% perl5.14.1
14444 root        22    0  9728K  1744K select  0   0:00  0.00% sudo
14445 root        21    0  9672K  1240K kqread  0   0:00  0.00% ping

   # vmstat -i
interrupt                          total       rate
irq14: ata0                      1577446         35
irq16: ehci0                       66968          1
irq23: ehci1                       94012          2
cpu0:timer                     180767557       4122
irq256: re0                    683483519      15587
cpu1:timer                     180031511       4105
cpu3:timer                     175311179       3998
cpu2:timer                     179460055       4092
Total                         1400792247      31947

    1 users    Load  6.02  5.59  5.31                  Dec 26 20:38

Mem:KB    REAL            VIRTUAL                       VN PAGER   SWAP PAGER
        Tot   Share      Tot    Share    Free           in   out     in   out
Act 1022276   12900  3562636    39576  208992  count           4
All 1143548   20380  5806292   100876          pages          48
Proc:                                                            Interrupts
  r   p   d   s   w   Csw  Trp  Sys  Int  Sof  Flt   1135 cow   37428 total
            186      129k  10k  17k  21k  14k 5857   2348 zfod     15 ata0 14
                                                      184 ozfod     1 ehci0 16
 8.1%Sys  68.4%Intr  5.9%User  0.0%Nice 17.6%Idle       7%ozfod     2 ehci1 23
|    |    |    |    |    |    |    |    |    |    |       daefr  4120 cpu0:timer
====++++++++++++++++++++++++++++++++++>>>            2423 prcfr 21013 re0 256
                                       208 dtbuf     4425 totfr  4100 cpu1:timer
Namei     Name-cache   Dir-cache    142271 desvn          react  4083 cpu3:timer
   Calls    hits   %    hits   %      3750 numvn          pdwak  4094 cpu2:timer
   36571   36546 100                  1998 frevn          pdpgs
                                                          intrn
Disks   ad0   da0 pass0                            241412 wire
KB/t  26.81  0.00  0.00                            826884 act
tps      15     0     0                           2714240 inact
MB/s   0.39  0.00  0.00                             97284 cache
%busy     1     0     0                            111708 free
                                                   114976 buf

# netstat -w 1 -I re0
            input          (re0)           output
   packets  errs idrops      bytes    packets  errs      bytes colls
     52329     0     0   40219676      58513     0   40189497     0
     50207     0     0   37985881      57340     0   38438634     0

http://piccy.info/view3/2409691/69d31186d8943a53c31ec193c8dfe79d/
http://piccy.info/view3/2409746/efb444ffe892592fbd6f025fd14535c4/
before overload happen, as you can see, server passthrought more traffic.

programs at this moment works very sloooow!
at the day on re0 there are can be more interrupts than now and server works fine

some problems with scheduler I think.

-- 
С уважением,
 Коньков                          mailto:kes-kes at yandex.ru