System perforamance 4.x vs. 5.x and 6.x

Thu Feb 14 22:27:32 UTC 2008

On Thu, 14 Feb 2008, Kris Kennaway wrote:

> We are going to need more information about your system.  What do you
> mean by "peak activity"?  What is running on the system when it performs
> badly (check top -S, ps, gstat, vmstat -w, vmstat -i).  What is your
> kernel configuration, dmesg and relevant aspects of the system
> configuration?
>
> Kris
>

I would call 120 processes with a load average of 0.03 and 99.9 idle
with 10-20 sendmail processes and 30 apache jobs nothing to write home
about.  But when that jumps to 250 processes, a load average of 30 with
50% idle (5-10 second waits on single character ssh echo) a bit busy.
That usually means my heavy pop3 users are checking in at the same time
someone (or 2 or 3) have sent email to the large volume listservs.  Proc
stat doesn't show as much as gstat and iostat.  Gstat alwasy shows my
drive with /var/mail being 97-100% busy and iostat will always show hi
tps rates, but never anything above 8MB/s (4.10 gave me 30MB/s+).

Kernel is generic with ipfirewall quota and smp (no ipfw rules yet).

On Thu, 14 Feb 2008, Bill Moran wrote:

> What _is_ the hardware?

Dell PowerEdge 1750 1U, 146Gig U320s.  The Broadcoms seem to be a change
from the earlier 1550s with intel pro/100s (I prefer the intel's).

On Thu, 14 Feb 2008, Kris Kennaway wrote:

> All it takes is a single bug (e.g. in a driver) to affect performance on
> a certain specific configuration.  However, bugs tend to get fixed over
> time.  Maybe that is the case for you.  It is well worth verifying
> whether the problem persists on the most up-to-date sources, so that
> everyone's time is not wasted in tracking down a problem that is already
> fixed.  You can just do a source upgrade from 6.2, which will be quite
> straightforward.

Agreed.  I have a 2nd machine that is identical to this one I could put
6.3 on to test this.

> It is pretty unusual for applications to be aborting, but usually they
> do it because they fail an application-specific run-time check.  What
> diagnostics are logged by the applications?  You may need to increase
> their respective verbosity/debug levels.
>
> Kris
>

I was suspicious that maybe we needed more memory but swap has barely even
been touched (232k used...with 1400meg inactive).

On Thu, 14 Feb 2008, Mike Tancsa wrote:

> No, but you havent given the list much to go on as to what the
> problems are or what hardware you are using, or really quantified the
> issue. By "slow" is the disk blocking on IO ? or are processes
> blocking on network IO etc etc.  6.2 was not a "bad" release, but 6.3
> is better than 6.2.  By starting with a more contemporary release,
> less effort by developers and other users need to be exerted in
> figuring out if the problem(s) you are running into have already been
> fixed.

It appears to me that disk access is extremely slow.  I can transfer
large files between the machines faster than making a duplicate copy
on disk.

> Because the drivers have changed since 4.10.  "improvements" could
> have introduced regressions... Change in the driver to support newer
> versions of a chipset might break older chipsets.

Any known issues with the Dell PERC RAID driver that anyone is aware
of?  I can start there.

> bge is a good example of a driver that has had a lot of changes and
> hasnt worked all that well at times.... hence the suggestion to try
> 6.3 as there have been many bug fixes.  Whether or not it fixes your
> problem its hard to say, but start there to see if things are faster
> and stable for you etc.
> e.g.
> http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/dev/bge/if_bge.c
>
> You should also post a full dmesg of the box as well as kernel config
> etc...

There kernel is generic with ipfirewall, quota and smp.

Feb 14 02:53:37 mail sm-mta[33143]: m1E9qKLZ033143: SYSERR(root): collect: I/O error on connection from astro.pryor.com, from=<CUSTOMERSERVICE at EM.PRYOR.COM>pid 31611 (milter-greylist), uid 25: exited on signal 3
Feb 14 03:17:08 mail sshd[34844]: warning: /etc/hosts.allow, line 45: can't verify hostname: getaddrinfo(host-200-6-102-230.iia.cl, AF_INET) failed
Feb 14 03:17:08 mail sshd[34844]: refused connect from 200.6.102.230 (200.6.102.230)
Feb 14 03:36:30 mail sshd[35944]: refused connect from 202.129.44.218 (202.129.44.218)
Feb 14 03:45:21 mail sshd[36667]: refused connect from 202.129.44.218 (202.129.44.218)
Feb 14 03:52:01 mail sm-mta[33092]: m1E9peX3033092: SYSERR(root): collect: read timeout on connection from astro.pryor.com, from=<CUSTOMERSERVICE at EM.PRYOR.COM>
Feb 14 07:24:01 mail sshd[52723]: warning: /etc/hosts.allow, line 45: can't verify hostname: getaddrinfo(42.215.6.200.intelnet.net.gt, AF_INET) failed
Feb 14 07:24:01 mail sshd[52723]: refused connect from 200.6.215.42 (200.6.215.42)
Feb 14 07:28:56 mail sm-mta[52866]: m1EEPPLC052866: SYSERR(root): collect: I/O error on connection from astro.pryor.com, from=<CUSTOMERSERVICE at EM.PRYOR.COM>
Feb 14 07:29:15 mail sshd[53465]: warning: /etc/hosts.allow, line 45: can't verify hostname: getaddrinfo(42.215.6.200.intelnet.net.gt, AF_INET) failed
Feb 14 07:29:15 mail sshd[53465]: refused connect from 200.6.215.42 (200.6.215.42)
Feb 14 08:01:57 mail sshd[58183]: refused connect from mail.rsib.net (12.46.46.98)
Feb 14 08:07:22 mail sshd[59017]: refused connect from mail.rsib.net (12.46.46.98)
Feb 14 09:50:00 mail su: bbump to root on /dev/ttyp0
pid 43464 (httpd), uid 80: exited on signal 6
pid 86995 (imapd), uid 2151: exited on signal 6
pid 85706 (httpd), uid 80: exited on signal 6
pid 87600 (imapd), uid 1376: exited on signal 6
pid 45621 (httpd), uid 80: exited on signal 6
pid 45617 (httpd), uid 80: exited on signal 6
Feb 14 11:28:36 mail inetd[48076]: imap4 from 208.107.161.82 exceeded counts/min (limit 60/min)
Feb 14 11:28:38 mail last message repeated 2 times
Feb 14 11:52:34 mail sm-mta[99563]: m1EHqX9u099563: SYSERR(root): collect: read timeout on connection from fulltimeconsult.com, from=<AARPMembership at wlq.fulltimsgeconsult.com>
Feb 14 13:06:27 mail su: bbump to root on /dev/ttyp0
pid 45995 (imapd), uid 3115: exited on signal 6
pid 46407 (imapd), uid 1873: exited on signal 6
pid 46418 (imapd), uid 2769: exited on signal 6
pid 46402 (imapd), uid 1873: exited on signal 6
pid 46651 (imapd), uid 2769: exited on signal 6
pid 46653 (imapd), uid 2769: exited on signal 6
pid 44499 (httpd), uid 80: exited on signal 6
pid 47035 (imapd), uid 1873: exited on signal 6
pid 46083 (httpd), uid 80: exited on signal 6
pid 46395 (httpd), uid 80: exited on signal 6
pid 46604 (httpd), uid 80: exited on signal 6
pid 46603 (httpd), uid 80: exited on signal 6

> what does
> netstat -ni
> give

-bash-2.05b$ netstat -ni
Name    Mtu Network       Address              Ipkts Ierrs    Opkts Oerrs  Coll
bge0   1500 <Link#1>      00:0f:1f:66:0e:e6 12511748   902 12025487     0     0
bge0   1500 208.107.160/2 208.107.161.82    17011211     - 16533277     -     -
bge1   1500 <Link#2>      00:0f:1f:66:0e:e8  3523091   586  4089056     0     0
bge1   1500 10.1.1/24     10.1.1.1           3516790     -  4087415     -     -
lo0   16384 <Link#3>                         4659734     0  4659733     0     0
lo0   16384 fe80:3::1/64  fe80:3::1                0     -        0     -     -
lo0   16384 ::1/128       ::1                   2772     -     2772     -     -
lo0   16384 127           127.0.0.1           147255     -   147255     -     -

> and what options do you have on ifconfig ?  Are the errors seen on
> your switch port as well or just in netstat -ni ?

ifconfig_bge0="inet 208.107.161.82  netmask 255.255.254.0 media 100baseTX mediaopt full-duplex"
ifconfig_bge1="inet 10.1.1.1        netmask 255.255.255.0 media 100baseTX mediaopt full-duplex"

No, the switch shows clear, they only show up as input errors on this box.
The box sitting under this one has an uptime of 621 days with 1 Oerr.

> Why are the processes sigabrting ? Is there anything in the
> application logs to indicate why they are exiting ?
>
>          ---Mike
>

[Thu Feb 14 09:59:23 2008] [notice] child pid 43464 exit signal Abort trap (6)
httpd in malloc(): error: recursive call
[Thu Feb 14 10:07:34 2008] [notice] child pid 85706 exit signal Abort trap (6)
httpd in free(): error: recursive call
[Thu Feb 14 10:48:39 2008] [notice] child pid 45621 exit signal Abort trap (6)
httpd in free(): error: recursive call

Memory.  This is why I was willing to throw another 2gig of memory in it,
but why am I only seeing 268K of swap used?

Brett