What is loading my server so much?

Thu Dec 9 13:53:26 UTC 2010

On Thu, 09 Dec 2010 12:31:04 +0100
Laszlo Nagy <gandalf at shopzeus.com> wrote:

> System is FreeBSD shopzeus.com 8.1-STABLE FreeBSD 8.1-STABLE #0: Sun Oct 
> 31 02:55:28 EDT 2010     amd64
> It has two quad-core Xeon CPUs, 24GB memory, and a RAID 1+0 array with 
> 10 disks + Areca 1680 controller with 2GB write back cache.
> 
> Server is running: mailscanner + apache multihost + PHP + postgresql. 
> Main load on the server is usually postgresql.
> 
> Today something happened. Number of http processes went up to 200. As a 
> result, number of connections to database also went up to 200, and the 
> web server is now refusing clients with "Cannot connect to database" 
> messages (coming from PHP).
> 
> This is a typical output from top:
> 
> last pid: 12789;  load averages:  7.77, 10.77, 
> 13.46                                                                                
> up 26+03:00:30  06:22:04
> 6637 processes: 7 running, 623 sleeping, 7 zombie
> CPU: 32.9% user,  0.0% nice,  7.6% system,  0.6% interrupt, 58.9% idle
> Mem: 3885M Active, 15G Inact, 3236M Wired, 627M Cache, 2465M Buf, 656M Free
> Swap: 12G Total, 12M Used, 12G Free
> 
>    PID USERNAME       THR PRI NICE   SIZE    RES STATE   C   TIME   WCPU 
> COMMAND
> 66834 pgsql            1 118    0   443M   417M CPU2    2  16:17 99.46% 
> postgres
> 11473 pgsql            1  72    0   441M   242M sbwait  5   0:02  4.59% 
> postgres
> 11026 pgsql            1  47    0   439M   249M sbwait  7   0:01  3.17% 
> postgres
>   6642 www              1  48    0   236M 42928K select  0   0:01  2.29% 
> httpd
> 10147 www              1  48    0   236M 44048K select  6   0:01  2.10% 
> httpd
>   3961 shopzeus        29  44    0   208M 96364K uwait   4  18.4H  1.37% 
> python
> 
> 
> Here is what I don't understand. "last pid" is increasing relatively 
> slowly, e.g. there are no hidden processes. Only the first one or two 
> processes are showing CPU load > 10%.  The "CPU User%" value is about 
> 50%. We have lots of free memory. I/O load is almost nothing (see iostat 
> below).
> 
> However, server load is between 7 and 13! In fact sometimes it is above 
> 16. And everybody complains that the server is too slow.
> 
> How can I find out what is causing the problem?

Step 1, get them to define "server" and "too slow":

If you log in and do shell ops, is the system slow to respond?  Based on
what you've reported, I'd be willing to bet that shell ops are pretty
responsive.  I can't be 100% sure without more information, but I'm
willing to be that what your users are complaining about is your web
application being slow.  Since you don't say what that application is,
I can only provide general advice.

I'm guessing that PostgreSQL is the bottleneck.  I'm going to first make
a few general suggestions, then provide suggestions on how to isolate the
problem more specifically.

First off, you have 24G of RAM available and PostgreSQL only seems to
have access to 400M of it.  Bump shared_buffers up to 2 or 3 G at least,
and bump up work_mem to at least a few hundred meg, and
maintenance_work_mem up to at 1/2G or so.

If the top and gstat outputs are typical, it looks like PostgreSQL is
doing mostly writes, but is not significantly blocked on writes.  It looks
like individual PostgreSQL processes are simply taking a long time to do
their work.

What's in your PostgreSQL log files?  If there's nothing, then bump up
the logging information in your postgresql.conf.  I particularly like
log_min_duration_statement at 500 ... any query that takes longer than
1/2 second to execute is suspect in the types applications I work with
most frequently.

If your application is developed in-house, I'd be willing to bet a paycheck
that there are LOTS of indexes missing and that PostgreSQL is doing lots
of seq scans where it could run lots faster if it had indexes.

Check also your autovacuum settings and ensure that tables are not bloating
out of control due to insufficient vacuuming.  You may have to vacuum full/
reindex the entire database to get things back under control, which can take
a long time if it's badly bloated.

Your application may also be suffering from lock contention if there are
lots of table locks used.  Looking at the pg_locks table while things are
slow can quickly identify if this is the case, and looking at
pg_stat_activity in conjunction with that table will usually narrow down
the problem pretty quickly.

Finally, if you find that PostgreSQL is the bottleneck and you can't
narrow it down enough to fix, join the PostgreSQL general questions
mailing list and ask for help with the same level of detail you did
here.  You'll find that they're an equally helpful community.

Good luck.  Hope this helps.

-Bill