FreeBSD 4.9 goes boom!

Wed Mar 24 13:49:28 PST 2004

Problem: FreeBSD 4.9 load average quickly goes to high levels such as 300.
System becomes unusable and HOPEFULLY reboots. In general though we have to
call a tech to reboot it by hitting the power switch.

Here is the setup:

I have a FreeBSD 4.9 server on a P4 with 256MB of RAM. We have a IDE drive.
We were using HiTech RAID-1, but it was flaky so now I'm just using a single
drive with regular IDE.

CPU: Intel(R) Pentium(R) 4 CPU 1500MHz (1494.47-MHz 686-class CPU)
  Origin = "GenuineIntel"  Id = 0xf07  Stepping = 7

Features=0x3febf9ff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,SEP,MTRR,PGE,MCA,CMOV
,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM>
real memory  = 268369920 (262080K bytes)
avail memory = 257400832 (251368K bytes)
Warning: Pentium 4 CPU: PSE disabled
Pentium Pro MTRR support enabled
atapci0: <Intel ICH2 ATA100 controller> port 0xf000-0xf00f at device 31.1 on
pci0
ad0: 38166MB <WDC WD400BB-00GFA0> [77545/16/63] at ata0-master UDMA33

On this server I have several jails:

jail 1 : running apache and serving about 6 hits/s on average.
jails 2 - 7 : running apache with just one children in general for SSL
(several SSL sites, several jails -- I'm moving to a single SSL jail and
using natd later)
jail 8 - a ssh jail for people to manage the sites

During normal loads we are okay on memory. (I am adding more.)

At all times we have about 1GB of paging disk free.

Normally, my 5 and 10 min loads are around 0.5 (I can watch column r in
vmstat and see we usually have 0 or 1 processes waiting.) This is normal:

last pid:  7924;  load averages:  0.11,  0.25,  0.49  up 0+00:39:40
15:30:01
345 processes: 2 running, 342 sleeping, 1 zombie

Mem: 137M Active, 27M Inact, 52M Wired, 2284K Cache, 35M Buf, 30M Free
Swap: 2048M Total, 31M Used, 2017M Free, 1% Inuse

  PID USERNAME PRI NICE  SIZE    RES STATE    TIME   WCPU    CPU COMMAND
 7914 root      30   0  2264K  1320K RUN      0:00 31.00%  1.51% top
 7883 root       2   0  6600K  6016K sbwait   0:00 13.84%  1.32% perl
 6660 nobody     2   0 17940K 12676K sbwait   0:01  1.07%  1.07% httpd
 7930 root      29   0  1852K   924K RUN      0:00 17.00%  0.83% top
  763 nobody    18   0 15004K  7144K lockf    0:02  0.15%  0.15% httpd
 7828 nobody     2   0 17732K 12424K accept   0:00  0.37%  0.15% httpd
 4586 nobody     2   0 17944K 12604K sbwait   0:01  0.10%  0.10% httpd
 7868 nobody     2   0 16376K 10944K accept   0:00  1.03%  0.10% httpd
 7910 root      -6   0  1968K  1356K piperd   0:00  2.00%  0.10% perl
 1461 nobody    18   0 14628K  6780K lockf    0:02  0.05%  0.05% httpd
 2812 nobody    18   0 14368K  6620K lockf    0:02  0.05%  0.05% httpd
 4575 nobody     2   0 17768K 12480K accept   0:01  0.05%  0.05% httpd
 4593 nobody     2   0 18080K 12780K sbwait   0:05  0.00%  0.00% httpd
 4422 root       2   0 16100K 10264K select   0:03  0.00%  0.00% httpd
 4595 nobody     2   0 17984K 12728K sbwait   0:03  0.00%  0.00% httpd
  764 nobody    18   0 14992K  7300K lockf    0:02  0.00%  0.00% httpd
 4560 nobody     2   0 17944K 12684K sbwait   0:02  0.00%  0.00% httpd
 4561 nobody     2   0 17944K 12672K sbwait   0:02  0.00%  0.00% httpd

But when the system crashes the system load just skyrockets:

last pid: 88248;  load averages: 238.98, 197.07, 127.85  up 2+17:12:36
14:45:38
709 processes: 257 running, 421 sleeping, 31 zombie

Mem: 143M Active, 21M Inact, 75M Wired, 7908K Cache, 35M Buf, 1844K Free
Swap: 2048M Total, 488M Used, 1560M Free, 23% Inuse

  PID USERNAME PRI NICE  SIZE    RES STATE    TIME   WCPU    CPU COMMAND
88185 root       2   0  6504K  5736K connec   0:00  1.47%  0.93% perl
25298 nobody   -18   0 13700K  1596K vmpfw    0:13  0.59%  0.39% httpd
57349 nobody   -18   0 14788K  1588K spread   0:10  0.57%  0.39% httpd
18115 nobody   -18   0 14224K  1604K vmpfw    0:21  0.39%  0.24% httpd
39876 root       2   0  2716K     0K RUN     10:12  0.00%  0.00% <top>
84557 nobody     2   0 22600K     0K RUN      9:54  0.00%  0.00% <httpd>
84567 nobody     2   0 22360K     0K sbwait   9:47  0.00%  0.00% <httpd>
84568 nobody     2   0 22564K     0K RUN      9:47  0.00%  0.00% <httpd>
84564 nobody     2   0 22680K     0K sbwait   9:41  0.00%  0.00% <httpd>
84556 nobody   -22   0 21092K   580K swread   9:39  0.00%  0.00% httpd
84554 nobody     2   0 22592K     0K RUN      9:32  0.00%  0.00% <httpd>
84555 nobody     2   0 22608K     0K RUN      9:31  0.00%  0.00% <httpd>
84558 nobody     2   0 22580K     0K RUN      9:22  0.00%  0.00% <httpd>
84563 nobody     2   0 22692K     0K RUN      9:07  0.00%  0.00% <httpd>
84560 nobody     2   0 22580K     0K RUN      8:56  0.00%  0.00% <httpd>
84398 root       2   0 21052K  1604K select   4:14  0.00%  0.00% httpd
   94 root       2   0   360K     0K nfsd     3:03  0.00%  0.00% <nfsd>
 3730 nobody    18   0 14888K     0K lockf    1:23  0.00%  0.00% <httpd>

Since I have 75M wired I have SOME memory available to my system.

I am using bsdsar. Our system crashed around 2:45 today:

Time   ad0  ad1  ad2  ad3  da0  da1  da2  da3  da4  da5  da6
13:40    0
14:00   33
14:20  146
15:00   40

Time     % User  % Sys  % Nice  % Intrpt  % Idle
13:40       1      2       0         2      96
14:00      11      2       0         0      87
14:20       0     12       0         0      88
15:00      10      6       0         0      84

Time   Free Mem  Active Mem  Inactive Mem  Total Swap  Used Swap  Free Swap
13:40     11M       129M           33M      2097024k    162608k    1934416k
14:00   5936K       149M           14M      2097024k    159464k    1937560k
14:20    904K       144M           24M      2097024k    303504k    1793520k
15:00    656K       163M           19M      2097024k      9544k    2087480k

I looked in /var/log/messages and saw nothing. I do have a lot of these:

Mar 24 13:49:49 europa /kernel: got bad cookie vp 0xd257ca00 bp 0xc651b57c
Mar 24 13:49:49 europa /kernel: got bad cookie vp 0xd257ca00 bp 0xc650a524
Mar 24 13:49:49 europa /kernel: got bad cookie vp 0xd257ca00 bp 0xc651b57c

It seems to come in spurts of once or twice an hour.

Any ideas?