identifying and fixing server I/O slowdowns

Fri Aug 6 00:17:04 PDT 2004

Oh great and wise FreeBSD gurus,

I've been running FreeBSD boxes for about five years with great 
results (up to 6 at the moment), but recently one of my machines has 
started to seriously act up.  Every time a heavy disk operation (say, 
tar'ing a 1 gig directory) occurs the system slows to a crawl, and 
requests to apache/php/mysql sites hosted on it just hang.

The system is a dual p3 1.13ghz box with a gig of ram and mirrored 80 
gig WD800BB drives on a Promise TX2 controller.  The raid isn't 
degraded.  There's a dedicated 1.5 gig swap partition and a swap file 
on the /usr partition.  We had some apache processes go nuts one 
time, which is why I added the swap file.

We run about 15 jails on the machine, with MySQL in the server proper 
and apache/php running inside the jails.  I initially thought it was 
a rogue process taking down the machine, but it seems to be that any 
heavy disk activity for more than a few minutes brings about the 
slowdown.  It doesn't happen instantly, but after a minute or two 
things will slow to a crawl.

I've recompiled the kernel a few times, upgraded to the latest 
4-STABLE rev, and even turned on device polling, but nothing seems to 
be helping.  It doesn't seem to happen on another machine we have 
with identical hardware.

My sysctl.conf:

kern.ipc.somaxconn=4096
net.inet.tcp.sendspace=32768
net.inet.tcp.recvspace=32768
net.inet.icmp.drop_redirect=1
net.inet.icmp.log_redirect=1
net.inet.ip.redirect=0
net.inet6.ip6.redirect=0
net.link.ether.inet.max_age=1200
net.inet.icmp.bmcastecho=0
net.inet.icmp.maskrepl=0
kern.maxfiles=65536
kern.ipc.shm_use_phys=1
kern.polling.enable=1

And a netstat -m:

301/928/131072 mbufs in use (current/peak/max):
         301 mbufs allocated to data
287/874/32768 mbuf clusters in use (current/peak/max)
1980 Kbytes allocated to network (2% of mb_map in use)
0 requests for memory denied
0 requests for memory delayed
0 calls to protocol drain routines

And here's a typical systat -v snapshot while the machine's 'ok':

     3 users    Load  0.32  0.38  0.31                  Aug  6 00:03

Mem:KB    REAL            VIRTUAL                     VN PAGER  SWAP PAGER
         Tot   Share      Tot    Share    Free         in  out     in  out
Act  221588   38656   747652   117796   39404 count    4           3
All 1024156   41620  1546136   144132         pages   18           5
                                                                  Interrupts
Proc:r  p  d  s  w    Csw  Trp  Sys  Int  Sof  Flt     21 cow    1156 total
      2     2 70       343  63322119 1156   57  397 186992 wire        fxp0 irq2
                                                    623848 act      13 
ohci0 irq9
  4.4%Sys   1.0%Intr  2.5%User  0.0%Nice 92.1%Idl   176096 inact    11 mux irq10
|    |    |    |    |    |    |    |    |    |      37220 cache       fdc0 irq6
==+>                                                 2184 free   1004 clk irq0
                                                           daefr   128 rtc irq8
Namei         Name-cache    Dir-cache                  15 prcfr
     Calls     hits    %     hits    %                   5 react
       126      125   99                                   pdwake
                                       340 zfod            pdpgs
Disks   ad4   ad6   fd0   md0         119 ofod          1 intrn
KB/t   0.00 16.72  0.00  0.00          34 %slo-z   114304 buf
tps       0    11     0     0         401 tfree       173 dirtybuf
MB/s   0.00  0.17  0.00  0.00                       70310 desiredvnodes
% busy    0     9     0     0                       64089 numvnodes
                                                     54829 freevnodes

And here's a systat -v snapshop while the machine's choking:

     4 users    Load  0.39  0.35  0.31                  Aug  6 00:08

Mem:KB    REAL            VIRTUAL                     VN PAGER  SWAP PAGER
         Tot   Share      Tot    Share    Free         in  out     in  out
Act  191344   34248   728736   117268   51916 count    1                6
All 1024676   37500  2075520   144188         pages    2               67
                                                                  Interrupts
Proc:r  p  d  s  w    Csw  Trp  Sys  Int  Sof  Flt     29 cow    1698 total
      5     2 70       573  74423171 1699  225  367 180904 wire        fxp0 irq2
                                                    640404 act     335 
ohci0 irq9
  5.7%Sys   1.9%Intr  7.5%User  0.0%Nice 84.9%Idl   153116 inact   236 mux irq10
|    |    |    |    |    |    |    |    |    |      50252 cache       fdc0 irq6
===+>>>>                                             1664 free    999 clk irq0
                                                           daefr   128 rtc irq8
Namei         Name-cache    Dir-cache                  93 prcfr
     Calls     hits    %     hits    %                   1 react
      8693     8196   94       12    0                     pdwake
                                       308 zfod       2693 pdpgs
Disks   ad4   ad6   fd0   md0         135 ofod            intrn
KB/t  98.81 16.61  0.00  0.00          43 %slo-z   114304 buf
tps      13   225     0     0        1277 tfree       278 dirtybuf
MB/s   1.23  3.64  0.00  0.00                       70310 desiredvnodes
% busy    2    99     0     0                       64089 numvnodes
                                                     52125 freevnodes

Thoughts?  Is there any way to force a machine to limit the 
monopolization of a disk controller by a process?

-- 

Jeff Kramer
jeffk at well.com
http://www.keika.org/