6.1 busy server periodically hangs, waits, then recovers a couple minutes later - analysis?

Wed Jun 18 21:55:37 UTC 2008

Hello List,

I've done some searching but this particular random and temporary lockup
condition that I'm experiencing doesn't seem to happen that much...anyways,
here goes with my symptoms and I was hoping someone could guide me towards
some add'l testing or stats I can gather to help pinpoint the root cause.

The symptoms are as follows during this indefinite frozen condition:

- Existing shell's will continue to be responsive
- Existing sessions such as http download continue to work
- Programs running within shells such as vmstat/systat/iostat, etc..
continue to spit out data
- *New* incoming socket requests or commands executed on the shell will sit
there indefinitely and come return an established connection or execute said
command several minutes later when the system returns to life.

Here's some data to start with:

kern.ostype: FreeBSD
kern.osrelease: 6.1-STABLE
kern.osrevision: 199506
kern.version: FreeBSD 6.1-STABLE #1: Sat Jul 15 03:08:58 MST 2006

net.isr.swi_count: -993194527
net.isr.drop: 0
net.isr.queued: 57342094
net.isr.deferred: 1573529131
net.isr.directed: -1846058772
net.isr.count: -272529641
net.isr.direct: 1
net.route.netisr_maxqlen: 256

I changed to net.isr.direct=1 and saw a dramatic drop in the number of
interrupts and a small drop in context switches but no real observed change
in the number of lockups that happen..

net.inet.ip.intr_queue_maxlen: 512
net.inet.ip.intr_queue_drops: 10855650

I changed maxlen to 512 and the queue_drops don't increment anymore...

hw.machine: amd64
hw.model: Intel(R) Xeon(R) CPU            5140  @ 2.33GHz
hw.ncpu: 4
hw.byteorder: 1234
hw.physmem: 8578887680
hw.usermem: 8305831936
hw.pagesize: 4096
hw.floatingpoint: 1
hw.machine_arch: amd64

dev.em.0.%desc: Intel(R) PRO/1000 Network Connection Version - 5.1.5
dev.em.0.%driver: em
dev.em.0.%location: slot=0 function=0
dev.em.0.%pnpinfo: vendor=0x8086 device=0x1096 subvendor=0x15d9
subdevice=0x0000 class=0x020000
dev.em.0.%parent: pci4
dev.em.0.debug_info: -1
dev.em.0.stats: -1
dev.em.0.rx_int_delay: 66
dev.em.0.tx_int_delay: 66
dev.em.0.rx_abs_int_delay: 666
dev.em.0.tx_abs_int_delay: 666
dev.em.0.rx_processing_limit: -1

I tried messing with the default moderated polling settings for the em
driver thinking the large number of interrupts coming from the NIC might
possibly have something to do with it but so far no change.

dev.mpt.0.%desc: LSILogic SAS Adapter
dev.mpt.0.%driver: mpt
dev.mpt.0.%location: slot=1 function=0
dev.mpt.0.%pnpinfo: vendor=0x1000 device=0x0054 subvendor=0x1000
subdevice=0x3050 class=0x010000
dev.mpt.0.%parent: pci5
dev.mpt.0.debug: 3

This is scary, does the GIANT-LOCKED mean that this storage subsystem driver
locks the entire kernel when it does I/O calls? (sorry I'm a little sketched
out reading about the random bits of freebsd6 that don't yet use finer
grained locking...)

/var/run/dmesg.boot:mpt0: <LSILogic SAS Adapter> port 0x3000-0x30ff mem
0xc8310000-0xc8313fff,0xc8300000-0xc830ffff irq 24 at device 1.0 on pci5
/var/run/dmesg.boot:mpt0: [GIANT-LOCKED]
/var/run/dmesg.boot:mpt0: MPI Version=1.5.12.0

Here's the various stats and logs....

vmstat -i
interrupt                          total       rate
irq4: sio0                      45153724         35
irq6: fdc0                             3          0
irq14: ata0                           47          0
irq18: em0                    6188085724       4798
irq24: mpt0                    340758261        264
cpu0: timer                   2572161860       1994
cpu1: timer                   2541470854       1970
cpu3: timer                   2575006045       1996
cpu2: timer                   2575266013       1997
Total                        16837902531      13057

Here I caught about a minute's worth of frozen state as evidenced by the
solid 15 or 17 processes in the wait queue and nothing really going on.

vmstat 5
 procs      memory      page                    disks     faults      cpu
 r b w     avm    fre  flt  re  pi  po  fr  sr da0 da1   in   sy  cs us sy
id
 2 4 0 1374060 281876   41   0   0   0 2694 2323   0   0  192 3062 2137  3
10 88
 2 11 0 1374060 348312  218   0   0   0 5263 8216  75 107 8129 25567 26444
6 15 79
 2 6 0 1374060 396152    0   2   0   0 6063 8359 107 115 8190 25814 26399  6
19 75
 0 7 1 1374060 309376    0   1   0   0 5623   0  63 130 8071 23806 25064  6
18 76
 1 5 0 1374060 389544    0   0   0   0 5632 8438  83  92 8371 24884 26412  6
15 79
 1 7 1 1374060 284640    0   1   0   0 5296   0  91  85 8380 23548 25405  6
18 76
 0 5 3 1374060 359252    0   0   0   0 5072 8226  71  79 8294 24216 26170  5
18 77
 1 8 0 1374060 411248    0   0   0   0 6034 8231  88 126 8366 27454 27470  6
18 76
 1 6 1 1374060 303560    0   1   0   0 5475   0  94 121 8239 25001 25857  6
17 77
 1 5 1 1374188 402788   14   1   0   0 5392 10179  52 139 8193 24831 25380
5 17 78
 0 5 1 1374192 301200   20   0   0   0 5178   0  86  92 8336 24750 26702  5
15 79
 1 7 0 1374192 369736    0   1   0   0 4941 8176  63  74 7998 23055 24414  6
17 78
 1 8 1 1374192 275292   13   1   0   0 5192   0 119 102 8230 24862 26339  5
15 79
 1 5 0 1374192 364596  459   0   1   1 5755 8438 112 124 8332 26315 26651  5
16 78
 0 8 0 1374192 267856    0   1   0   0 4897   0 105  88 7997 22995 24989  6
16 78
 1 12 0 1374192 339624    0   0   0   0 4920 8294  75 135 8227 24276 25400
5 16 80
 2 5 0 1374200 397088   13   1   0   0 5718 8412  89  91 8116 26028 26313  6
18 77
 2 6 0 1374228 292596   10   1   0   0 5374   0  94  96 8075 24513 25413  6
18 76
 0 6 1 1374236 341492    7   1   0   0 6057 8342 134 116 8268 27172 27364  6
19 76
 3 8 0 1374240 390940    7   1   0   0 5916 8243 108 121 8304 27672 27495  6
18 76
 2 7 0 1374252 275052    7   2   0   0 5810   0 129  91 8198 24374 25501  5
18 77
 4 8 0 1374252 321808    3   1   0   0 5921 8236 100 113 8273 26616 27128  6
19 75
 1 8 0 1374268 272796   11   0   0   0 4735   0  83 114 7848 21924 24042  5
15 80
 2 7 0 1374284 324676    7   0   0   0 6110 8201 114  93 8284 24293 25378  6
18 76
 2 8 0 1374296 360076   10   1   0   0 6956 8328 136 125 8348 26724 27040  6
19 75
 1 5 1 1374312 407356  230   0   0   0 6934 8229 126 131 8319 28257 27213  6
18 76
 1 11 1 1374344 300228   13   0   0   0 6493   0 127  97 8354 25438 25542  6
19 75
 0 4 1 1374344 348096   13   1   0   0 6719 8345  96 112 6806 28724 24457  7
19 74
 1 10 1 1374952 404552   31   0   0   0 5796 8165 107 123 6340 24678 21992
6 17 78
 0 13 1 1374952 374140    0   0   0   0 7774   0  97 117 6452 23443 22131  4
14 81
 3 6 0 1374956 441772    3   0   0   0 3232 8181 147 110 6429 24878 22451  5
17 78
 1 7 0 1374968 341796   10   0   0   0 4838   0  89 115 6542 24689 22643  5
17 78
 0 13 0 1375012 414996   11   1   0   0 4828 8282  97 103 6162 21969 20812
4 14 81
 0 15 0 1375192 414196   14   0   0   0  39   0   0   0 1695  380 3494  0  0
100
 0 15 0 1375192 414196    0   0   0   0   0   0   0   0 1255   23 2565  0  0
100
 0 15 0 1375192 414196    0   0   0   0   0   0   0   0 1248   25 2549  0  0
100
 0 15 0 1375192 414196    0   0   0   0   0   0   0   0 1012   23 2079  0  0
100
 0 15 0 1375192 414196  218   0   0   0 252   0   0   0  828  476 1779  0  0
100
 0 15 0 1375192 414196    0   0   0   0   0   0   0   0  783   23 1618  0  0
100
 0 15 0 1375192 414196    0   0   0   0   0   0   0   0  809   24 1670  0  0
100
 0 15 0 1375192 414196    0   0   0   0   0   0   0   0  725   23 1503  0  0
100
 0 15 0 1375192 414196    0   0   0   0   0   0   0   0  545   25 1140  0  0
100
 0 15 0 1375192 414196    0   0   0   0   0   0   0   0  489   23 1027  0  0
100
 0 15 0 1375192 414196    0   0   0   0   0   0   0   0  509   72 1075  0  0
100
 2 14 0 1375376 412284   39   0   0   0  46   0   6   2 5397  359 1170  0 79
21
 0 15 0 1375380 410128   99   0   0   0 211   0  10  46 1921  839 1182  0 69
31
 1 15 0 1375380 421280  266   0   0   0 907   0  15   0 1582  578 589  0 73
26
 0 15 0 1375380 421504   13   0   0   0  45   0  26   1 1581   94 497  0 59
41
 0 17 1 1377496 430164  211   0   0   0 1505   0  16  19 2869 2736 2900  2
68 30
 2 17 0 1377060 386444  264   1   0   0 970   0  34  30 3532 4017 3917  3 87
10
 2 7 0 1377180 414764 1166   5   2   0 8861 8362 124 174 6236 31661 20389 10
26 64
 0 7 1 1377180 346060   67   0   0   0 8122   0  77 220 6442 24740 22522  5
21 74
 2 4 0 1377180 391492    7   0   0   0 5783 8273  87  72 6218 23687 22724  5
17 78
 0 6 0 1377180 324412   10   0   0   0 5293   0  63 111 6949 20025 20603  4
17 79
 2 6 0 1377180 426368   15   0   0   0 6419 8211 101 104 6487 18747 19175  5
23 72
 0 4 1 1377180 334612   26   1   0   0 4488   0 133 115 6490 23064 22344  5
15 79
 2 8 0 1377180 286908  104   0   0   0 4959   0 132  75 6222 18812 18577  5
27 68
 1 10 0 1377180 389248  226   1   0   0 6362 9045 125 111 6263 22443 21853
5 17 78
 2 11 1 1377180 291636   51   0   0   0 5923   0  98 116 6474 21827 20642  6
26 68
 0 4 2 1377180 306052   37   1   0   0 7971   0  68  91 6006 21735 20893  5
18 78
 1 6 0 1377180 428352   30   0   0   0 3050 8609 120  81 5949 19114 18973  4
25 71
 0 6 1 1377180 374184   24   0   0   0 4525   0  97  92 6483 22294 21853  5
18 77
 2 5 1 1377180 319532    4   0   0   0 4965   0 105  88 6348 21421 21703  5
18 77

Here's a normal top -S output, I haven't been able to grab one from when the
problem exists...

last pid: 67537;  load averages:  0.72,  0.93,
0.94
up 15+16:05:21  14:12:38
83 processes:  7 running, 63 sleeping, 13 waiting
CPU states:  4.7% user,  0.0% nice, 12.6% system,  0.2% interrupt, 82.5%
idle
Mem: 1232M Active, 5912M Inact, 261M Wired, 285M Cache, 214M Buf, 11M Free
Swap: 9216M Total, 32K Used, 9216M Free

  PID USERNAME  THR PRI NICE   SIZE    RES STATE  C   TIME   WCPU COMMAND
   12 root        1 171   52     0K    16K CPU2   2 344.0H 88.72% idle: cpu2
   14 root        1 171   52     0K    16K RUN    0 338.1H 88.33% idle: cpu0
   13 root        1 171   52     0K    16K RUN    1 338.3H 85.64% idle: cpu1
   11 root        1 171   52     0K    16K CPU3   3 282.4H 62.89% idle: cpu3
   24 root        1 -68    0     0K    16K CPU3   3  29.5H 20.17% em0 taskq
66405 squid       1   4    0   832M   819M kqread 3  13:47 13.28% squid
82587 squid       1   4    0   294M   280M kqread 3 885:07 12.55% squid

At first I was thinking the single NIC and single CPU that gets hit from it
was draining to 0% idle but I don't think it's really related and wouldn't
explain why new processes couldn't run on new sessions, etc..

as for dmesg errors, all we see are these errors that pop every minute or
so:

mpt0: Unhandled Event Notify Frame. Event 0xe (ACK not required).
mpt0: mpt_cam_event: 0xe

but also once the problem starts we see a bunch of timeouts from mpt0:

mpt0: request 0xffffffff8bd10010:19992 timed out for ccb 0xffffff022d40b400
(req->ccb 0xffffff022d40b400)
mpt0: request 0xffffffff8bd0f5c0:19993 timed out for ccb 0xffffff01949b4800
(req->ccb 0xffffff01949b4800)
mpt0: attempting to abort req 0xffffffff8bd10010:19992 function 0
mpt0: request 0xffffffff8bd0fb98:19998 timed out for ccb 0xffffff020d3f3c00
(req->ccb 0xffffff020d3f3c00)
mpt0: completing timedout/aborted req 0xffffffff8bd10010:19992
mpt0: request 0xffffffff8bd0a9c8:20002 timed out for ccb 0xffffff022e147800
(req->ccb 0xffffff022e147800)
mpt0: request 0xffffffff8bd098f0:20003 timed out for ccb 0xffffff0000f01800
(req->ccb 0xffffff0000f01800)
(da2:mpt0:0:17:0): WRITE(10). CDB: 2a 0 c 5e f cf 0 0 80 0
(da2:mpt0:0:17:0): CAM Status: SCSI Status Error
(da2:mpt0:0:17:0): SCSI Status: Check Condition
(da2:mpt0:0:17:0): UNIT ATTENTION asc:29,7
(da2:mpt0:0:17:0): Reserved ASC/ASCQ pair
(da2:mpt0:0:17:0): Retrying Command (per Sense Data)
mpt0: abort of req 0xffffffff8bd10010:0 completed
mpt0: attempting to abort req 0xffffffff8bd0f5c0:19993 function 0
mpt0: request 0xffffffff8bd13978:20004 timed out for ccb 0xffffff0000d60000
(req->ccb 0xffffff0000d60000)
mpt0: completing timedout/aborted req 0xffffffff8bd0f5c0:19993
mpt0: request 0xffffffff8bd0e648:20005 timed out for ccb 0xffffff0000cb2000
(req->ccb 0xffffff0000cb2000)
<snip>
mpt0: abort of req 0xffffffff8bd12a58:0 completed
mpt0: attempting to abort req 0xffffffff8bd130e0:20027 function 0
mpt0: completing timedout/aborted req 0xffffffff8bd130e0:20027
mpt0: abort of req 0xffffffff8bd130e0:0 completed
mpt0: attempting to abort req 0xffffffff8bd11f58:20028 function 0
mpt0: completing timedout/aborted req 0xffffffff8bd11f58:20028
mpt0: abort of req 0xffffffff8bd11f58:0 completed
mpt0: attempting to abort req 0xffffffff8bd0d4c0:20029 function 0
mpt0: completing timedout/aborted req 0xffffffff8bd0d4c0:20029
<snip>
Etc...

Here's what some typical disk activity looks like:

iostat -d 5
             da0              da1              da2              da3
  KB/t tps  MB/s   KB/t tps  MB/s   KB/t tps  MB/s   KB/t tps  MB/s
 47.87  45  2.11  51.09  54  2.69  50.72  56  2.77  50.09  61  3.00
 47.47  93  4.30  54.25  92  4.87  49.92  77  3.77  43.49  81  3.45
 70.62  49  3.41  62.31  80  4.89  54.47  86  4.58  44.36  70  3.03
 56.75  50  2.79  52.30 103  5.26  64.82  56  3.53  50.95  55  2.73
 59.76  74  4.34  63.63  64  3.98  43.97 141  6.04  45.34 104  4.60
 41.91 105  4.30  50.78  96  4.75  61.19  66  3.96  50.23  82  4.00
 54.88  60  3.20  64.17  70  4.37  71.42  74  5.16  40.63 104  4.11
 52.17 110  5.60  54.50  86  4.55  50.85  95  4.74  36.75  91  3.28

I have not used WITNESS before but would this be a good time to start
looking?  Is the server simply too busy?  What else could I look for or try
tweaking to get around this problem that doesn't happen at lower off-peak
load levels?

Thanks,
Hazlewood