6.1 busy server periodically hangs, waits,
then recovers a couple minutes later - analysis?
Hazlewood
hazlewood at gmail.com
Wed Jun 18 21:55:37 UTC 2008
Hello List,
I've done some searching but this particular random and temporary lockup
condition that I'm experiencing doesn't seem to happen that much...anyways,
here goes with my symptoms and I was hoping someone could guide me towards
some add'l testing or stats I can gather to help pinpoint the root cause.
The symptoms are as follows during this indefinite frozen condition:
- Existing shell's will continue to be responsive
- Existing sessions such as http download continue to work
- Programs running within shells such as vmstat/systat/iostat, etc..
continue to spit out data
- *New* incoming socket requests or commands executed on the shell will sit
there indefinitely and come return an established connection or execute said
command several minutes later when the system returns to life.
Here's some data to start with:
kern.ostype: FreeBSD
kern.osrelease: 6.1-STABLE
kern.osrevision: 199506
kern.version: FreeBSD 6.1-STABLE #1: Sat Jul 15 03:08:58 MST 2006
net.isr.swi_count: -993194527
net.isr.drop: 0
net.isr.queued: 57342094
net.isr.deferred: 1573529131
net.isr.directed: -1846058772
net.isr.count: -272529641
net.isr.direct: 1
net.route.netisr_maxqlen: 256
I changed to net.isr.direct=1 and saw a dramatic drop in the number of
interrupts and a small drop in context switches but no real observed change
in the number of lockups that happen..
net.inet.ip.intr_queue_maxlen: 512
net.inet.ip.intr_queue_drops: 10855650
I changed maxlen to 512 and the queue_drops don't increment anymore...
hw.machine: amd64
hw.model: Intel(R) Xeon(R) CPU 5140 @ 2.33GHz
hw.ncpu: 4
hw.byteorder: 1234
hw.physmem: 8578887680
hw.usermem: 8305831936
hw.pagesize: 4096
hw.floatingpoint: 1
hw.machine_arch: amd64
dev.em.0.%desc: Intel(R) PRO/1000 Network Connection Version - 5.1.5
dev.em.0.%driver: em
dev.em.0.%location: slot=0 function=0
dev.em.0.%pnpinfo: vendor=0x8086 device=0x1096 subvendor=0x15d9
subdevice=0x0000 class=0x020000
dev.em.0.%parent: pci4
dev.em.0.debug_info: -1
dev.em.0.stats: -1
dev.em.0.rx_int_delay: 66
dev.em.0.tx_int_delay: 66
dev.em.0.rx_abs_int_delay: 666
dev.em.0.tx_abs_int_delay: 666
dev.em.0.rx_processing_limit: -1
I tried messing with the default moderated polling settings for the em
driver thinking the large number of interrupts coming from the NIC might
possibly have something to do with it but so far no change.
dev.mpt.0.%desc: LSILogic SAS Adapter
dev.mpt.0.%driver: mpt
dev.mpt.0.%location: slot=1 function=0
dev.mpt.0.%pnpinfo: vendor=0x1000 device=0x0054 subvendor=0x1000
subdevice=0x3050 class=0x010000
dev.mpt.0.%parent: pci5
dev.mpt.0.debug: 3
This is scary, does the GIANT-LOCKED mean that this storage subsystem driver
locks the entire kernel when it does I/O calls? (sorry I'm a little sketched
out reading about the random bits of freebsd6 that don't yet use finer
grained locking...)
/var/run/dmesg.boot:mpt0: <LSILogic SAS Adapter> port 0x3000-0x30ff mem
0xc8310000-0xc8313fff,0xc8300000-0xc830ffff irq 24 at device 1.0 on pci5
/var/run/dmesg.boot:mpt0: [GIANT-LOCKED]
/var/run/dmesg.boot:mpt0: MPI Version=1.5.12.0
Here's the various stats and logs....
vmstat -i
interrupt total rate
irq4: sio0 45153724 35
irq6: fdc0 3 0
irq14: ata0 47 0
irq18: em0 6188085724 4798
irq24: mpt0 340758261 264
cpu0: timer 2572161860 1994
cpu1: timer 2541470854 1970
cpu3: timer 2575006045 1996
cpu2: timer 2575266013 1997
Total 16837902531 13057
Here I caught about a minute's worth of frozen state as evidenced by the
solid 15 or 17 processes in the wait queue and nothing really going on.
vmstat 5
procs memory page disks faults cpu
r b w avm fre flt re pi po fr sr da0 da1 in sy cs us sy
id
2 4 0 1374060 281876 41 0 0 0 2694 2323 0 0 192 3062 2137 3
10 88
2 11 0 1374060 348312 218 0 0 0 5263 8216 75 107 8129 25567 26444
6 15 79
2 6 0 1374060 396152 0 2 0 0 6063 8359 107 115 8190 25814 26399 6
19 75
0 7 1 1374060 309376 0 1 0 0 5623 0 63 130 8071 23806 25064 6
18 76
1 5 0 1374060 389544 0 0 0 0 5632 8438 83 92 8371 24884 26412 6
15 79
1 7 1 1374060 284640 0 1 0 0 5296 0 91 85 8380 23548 25405 6
18 76
0 5 3 1374060 359252 0 0 0 0 5072 8226 71 79 8294 24216 26170 5
18 77
1 8 0 1374060 411248 0 0 0 0 6034 8231 88 126 8366 27454 27470 6
18 76
1 6 1 1374060 303560 0 1 0 0 5475 0 94 121 8239 25001 25857 6
17 77
1 5 1 1374188 402788 14 1 0 0 5392 10179 52 139 8193 24831 25380
5 17 78
0 5 1 1374192 301200 20 0 0 0 5178 0 86 92 8336 24750 26702 5
15 79
1 7 0 1374192 369736 0 1 0 0 4941 8176 63 74 7998 23055 24414 6
17 78
1 8 1 1374192 275292 13 1 0 0 5192 0 119 102 8230 24862 26339 5
15 79
1 5 0 1374192 364596 459 0 1 1 5755 8438 112 124 8332 26315 26651 5
16 78
0 8 0 1374192 267856 0 1 0 0 4897 0 105 88 7997 22995 24989 6
16 78
1 12 0 1374192 339624 0 0 0 0 4920 8294 75 135 8227 24276 25400
5 16 80
2 5 0 1374200 397088 13 1 0 0 5718 8412 89 91 8116 26028 26313 6
18 77
2 6 0 1374228 292596 10 1 0 0 5374 0 94 96 8075 24513 25413 6
18 76
0 6 1 1374236 341492 7 1 0 0 6057 8342 134 116 8268 27172 27364 6
19 76
3 8 0 1374240 390940 7 1 0 0 5916 8243 108 121 8304 27672 27495 6
18 76
2 7 0 1374252 275052 7 2 0 0 5810 0 129 91 8198 24374 25501 5
18 77
4 8 0 1374252 321808 3 1 0 0 5921 8236 100 113 8273 26616 27128 6
19 75
1 8 0 1374268 272796 11 0 0 0 4735 0 83 114 7848 21924 24042 5
15 80
2 7 0 1374284 324676 7 0 0 0 6110 8201 114 93 8284 24293 25378 6
18 76
2 8 0 1374296 360076 10 1 0 0 6956 8328 136 125 8348 26724 27040 6
19 75
1 5 1 1374312 407356 230 0 0 0 6934 8229 126 131 8319 28257 27213 6
18 76
1 11 1 1374344 300228 13 0 0 0 6493 0 127 97 8354 25438 25542 6
19 75
0 4 1 1374344 348096 13 1 0 0 6719 8345 96 112 6806 28724 24457 7
19 74
1 10 1 1374952 404552 31 0 0 0 5796 8165 107 123 6340 24678 21992
6 17 78
0 13 1 1374952 374140 0 0 0 0 7774 0 97 117 6452 23443 22131 4
14 81
3 6 0 1374956 441772 3 0 0 0 3232 8181 147 110 6429 24878 22451 5
17 78
1 7 0 1374968 341796 10 0 0 0 4838 0 89 115 6542 24689 22643 5
17 78
0 13 0 1375012 414996 11 1 0 0 4828 8282 97 103 6162 21969 20812
4 14 81
0 15 0 1375192 414196 14 0 0 0 39 0 0 0 1695 380 3494 0 0
100
0 15 0 1375192 414196 0 0 0 0 0 0 0 0 1255 23 2565 0 0
100
0 15 0 1375192 414196 0 0 0 0 0 0 0 0 1248 25 2549 0 0
100
0 15 0 1375192 414196 0 0 0 0 0 0 0 0 1012 23 2079 0 0
100
0 15 0 1375192 414196 218 0 0 0 252 0 0 0 828 476 1779 0 0
100
0 15 0 1375192 414196 0 0 0 0 0 0 0 0 783 23 1618 0 0
100
0 15 0 1375192 414196 0 0 0 0 0 0 0 0 809 24 1670 0 0
100
0 15 0 1375192 414196 0 0 0 0 0 0 0 0 725 23 1503 0 0
100
0 15 0 1375192 414196 0 0 0 0 0 0 0 0 545 25 1140 0 0
100
0 15 0 1375192 414196 0 0 0 0 0 0 0 0 489 23 1027 0 0
100
0 15 0 1375192 414196 0 0 0 0 0 0 0 0 509 72 1075 0 0
100
2 14 0 1375376 412284 39 0 0 0 46 0 6 2 5397 359 1170 0 79
21
0 15 0 1375380 410128 99 0 0 0 211 0 10 46 1921 839 1182 0 69
31
1 15 0 1375380 421280 266 0 0 0 907 0 15 0 1582 578 589 0 73
26
0 15 0 1375380 421504 13 0 0 0 45 0 26 1 1581 94 497 0 59
41
0 17 1 1377496 430164 211 0 0 0 1505 0 16 19 2869 2736 2900 2
68 30
2 17 0 1377060 386444 264 1 0 0 970 0 34 30 3532 4017 3917 3 87
10
2 7 0 1377180 414764 1166 5 2 0 8861 8362 124 174 6236 31661 20389 10
26 64
0 7 1 1377180 346060 67 0 0 0 8122 0 77 220 6442 24740 22522 5
21 74
2 4 0 1377180 391492 7 0 0 0 5783 8273 87 72 6218 23687 22724 5
17 78
0 6 0 1377180 324412 10 0 0 0 5293 0 63 111 6949 20025 20603 4
17 79
2 6 0 1377180 426368 15 0 0 0 6419 8211 101 104 6487 18747 19175 5
23 72
0 4 1 1377180 334612 26 1 0 0 4488 0 133 115 6490 23064 22344 5
15 79
2 8 0 1377180 286908 104 0 0 0 4959 0 132 75 6222 18812 18577 5
27 68
1 10 0 1377180 389248 226 1 0 0 6362 9045 125 111 6263 22443 21853
5 17 78
2 11 1 1377180 291636 51 0 0 0 5923 0 98 116 6474 21827 20642 6
26 68
0 4 2 1377180 306052 37 1 0 0 7971 0 68 91 6006 21735 20893 5
18 78
1 6 0 1377180 428352 30 0 0 0 3050 8609 120 81 5949 19114 18973 4
25 71
0 6 1 1377180 374184 24 0 0 0 4525 0 97 92 6483 22294 21853 5
18 77
2 5 1 1377180 319532 4 0 0 0 4965 0 105 88 6348 21421 21703 5
18 77
Here's a normal top -S output, I haven't been able to grab one from when the
problem exists...
last pid: 67537; load averages: 0.72, 0.93,
0.94
up 15+16:05:21 14:12:38
83 processes: 7 running, 63 sleeping, 13 waiting
CPU states: 4.7% user, 0.0% nice, 12.6% system, 0.2% interrupt, 82.5%
idle
Mem: 1232M Active, 5912M Inact, 261M Wired, 285M Cache, 214M Buf, 11M Free
Swap: 9216M Total, 32K Used, 9216M Free
PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND
12 root 1 171 52 0K 16K CPU2 2 344.0H 88.72% idle: cpu2
14 root 1 171 52 0K 16K RUN 0 338.1H 88.33% idle: cpu0
13 root 1 171 52 0K 16K RUN 1 338.3H 85.64% idle: cpu1
11 root 1 171 52 0K 16K CPU3 3 282.4H 62.89% idle: cpu3
24 root 1 -68 0 0K 16K CPU3 3 29.5H 20.17% em0 taskq
66405 squid 1 4 0 832M 819M kqread 3 13:47 13.28% squid
82587 squid 1 4 0 294M 280M kqread 3 885:07 12.55% squid
At first I was thinking the single NIC and single CPU that gets hit from it
was draining to 0% idle but I don't think it's really related and wouldn't
explain why new processes couldn't run on new sessions, etc..
as for dmesg errors, all we see are these errors that pop every minute or
so:
mpt0: Unhandled Event Notify Frame. Event 0xe (ACK not required).
mpt0: mpt_cam_event: 0xe
but also once the problem starts we see a bunch of timeouts from mpt0:
mpt0: request 0xffffffff8bd10010:19992 timed out for ccb 0xffffff022d40b400
(req->ccb 0xffffff022d40b400)
mpt0: request 0xffffffff8bd0f5c0:19993 timed out for ccb 0xffffff01949b4800
(req->ccb 0xffffff01949b4800)
mpt0: attempting to abort req 0xffffffff8bd10010:19992 function 0
mpt0: request 0xffffffff8bd0fb98:19998 timed out for ccb 0xffffff020d3f3c00
(req->ccb 0xffffff020d3f3c00)
mpt0: completing timedout/aborted req 0xffffffff8bd10010:19992
mpt0: request 0xffffffff8bd0a9c8:20002 timed out for ccb 0xffffff022e147800
(req->ccb 0xffffff022e147800)
mpt0: request 0xffffffff8bd098f0:20003 timed out for ccb 0xffffff0000f01800
(req->ccb 0xffffff0000f01800)
(da2:mpt0:0:17:0): WRITE(10). CDB: 2a 0 c 5e f cf 0 0 80 0
(da2:mpt0:0:17:0): CAM Status: SCSI Status Error
(da2:mpt0:0:17:0): SCSI Status: Check Condition
(da2:mpt0:0:17:0): UNIT ATTENTION asc:29,7
(da2:mpt0:0:17:0): Reserved ASC/ASCQ pair
(da2:mpt0:0:17:0): Retrying Command (per Sense Data)
mpt0: abort of req 0xffffffff8bd10010:0 completed
mpt0: attempting to abort req 0xffffffff8bd0f5c0:19993 function 0
mpt0: request 0xffffffff8bd13978:20004 timed out for ccb 0xffffff0000d60000
(req->ccb 0xffffff0000d60000)
mpt0: completing timedout/aborted req 0xffffffff8bd0f5c0:19993
mpt0: request 0xffffffff8bd0e648:20005 timed out for ccb 0xffffff0000cb2000
(req->ccb 0xffffff0000cb2000)
<snip>
mpt0: abort of req 0xffffffff8bd12a58:0 completed
mpt0: attempting to abort req 0xffffffff8bd130e0:20027 function 0
mpt0: completing timedout/aborted req 0xffffffff8bd130e0:20027
mpt0: abort of req 0xffffffff8bd130e0:0 completed
mpt0: attempting to abort req 0xffffffff8bd11f58:20028 function 0
mpt0: completing timedout/aborted req 0xffffffff8bd11f58:20028
mpt0: abort of req 0xffffffff8bd11f58:0 completed
mpt0: attempting to abort req 0xffffffff8bd0d4c0:20029 function 0
mpt0: completing timedout/aborted req 0xffffffff8bd0d4c0:20029
<snip>
Etc...
Here's what some typical disk activity looks like:
iostat -d 5
da0 da1 da2 da3
KB/t tps MB/s KB/t tps MB/s KB/t tps MB/s KB/t tps MB/s
47.87 45 2.11 51.09 54 2.69 50.72 56 2.77 50.09 61 3.00
47.47 93 4.30 54.25 92 4.87 49.92 77 3.77 43.49 81 3.45
70.62 49 3.41 62.31 80 4.89 54.47 86 4.58 44.36 70 3.03
56.75 50 2.79 52.30 103 5.26 64.82 56 3.53 50.95 55 2.73
59.76 74 4.34 63.63 64 3.98 43.97 141 6.04 45.34 104 4.60
41.91 105 4.30 50.78 96 4.75 61.19 66 3.96 50.23 82 4.00
54.88 60 3.20 64.17 70 4.37 71.42 74 5.16 40.63 104 4.11
52.17 110 5.60 54.50 86 4.55 50.85 95 4.74 36.75 91 3.28
I have not used WITNESS before but would this be a good time to start
looking? Is the server simply too busy? What else could I look for or try
tweaking to get around this problem that doesn't happen at lower off-peak
load levels?
Thanks,
Hazlewood
More information about the freebsd-performance
mailing list