system locks up when pf is enabled? (was: system locks up with vr driver on alix board)

Ask Bjørn Hansen ask at develooper.com
Wed Aug 31 03:53:19 UTC 2011


On Aug 16, 2011, at 20:15, Julian Elischer wrote:

> from your description it doesn't sound like a vr problem.
> I suggest you hook up teh serial console (I am guessing you already have)
> and set the config options to allow break-to-debugger or alt-break-to-debugger on it
> when it happens next, drop into the debugger..
> 
> in fact, drop in, and do a ps to see what processes are runnable,
> 'tr [pid] (or thread id)' to get a stack trace of anything that looks
> interesting, and then cont and do it again a few times to get a feel
> for where the processor is hanging out (a straight 'tr' will give you
> the interrupt of the com port which is not intresting..)

Alright, I thought I had locked this down to ZERO_COPY_SOCKETS being enabled in the kernel (as unlikely as it seems that it'd break things without being explicitly used -- I wanted to believe!).

A couple days ago I figured out how to consistently make it happen (or not).  For about a week we'd been running with no trouble; until I realized that a hostname added to our pf rules made them not load on startup.  When running without our pf rules loaded, everything is fine.  WIth the pf rules loaded the system will hang within 2-4 hours.

Our ruleset was about 240 rules.  We cut it down to ~140 rules today to see if that made a difference; it didn't.  We also turned off as much IPv6 traffic as we could in case that was what was disturbing things (adding IPv6 traffic is relatively new).

Flipping the backup system to be the master for the active IP addresses doesn't (as far as I can tell) make the old master recover.

I tried looking for interesting information with the kernel debugger; but honestly nothing springs out as interesting -- though that may be due to my lack of skills more than anything else.   I am including a snapshot below.  I ran ps and tr on 100024, 100025 (vr0 and vr1 interrupt threads) and 100022 (thread taskq) a few more times but it seemed to me to give basically the same thread trace information.

For what it's worth, if I let 'cont' run for long enough for the system to send out carp packets and get it's IPs back, 'ps' seemed more likely to have 100024 or 100025 (vr0/vr1) in 'Run' state.  If I interrupted the system again relatively quickly after continuing, it was more likely to have 'thread tasks' be the busy process.

Any tips for what I can do to extract something more useful would be greatly appreciated.

As a last note -- earlier today I noticed that "Searches" in 'pfctl -si' seemed to go up significantly just before the console locks up (which is 15-30 minutes before the box stops routing packets).   My theory was that we were getting some sort of tiny DoS attack (doesn't take much to take out a computer the size of a CD case); but that doesn't really make sense as the system 1) never recovers and 2) the backup is fine as soon as the master has been turned off / changed priorities.


 - ask


db> ps
  pid  ppid  pgrp   uid   state   wmesg     wchan    cmd
 3300  2371  2371     0  R                           sshd
 3299  3298  3298     0  S+      sbwait   0xc2ddb58c ssh
 3298  3100  3298     0  S+      piperd   0xc2624620 scp
 3100  3098  3100     0  Ss+     pause    0xc2e95850 csh
 3098  2371  3098     0  Rs                          sshd
 2904  2371  2904     0  Ss      select   0xc379a924 sshd
 2476  2474  2476     0  Ss+     ttyin    0xc2e44070 csh
 2474  2371  2474     0  Rs                          sshd
 2440     1  2440     0  Rs+                         getty
 2433     1  2433     0  Rs                          cron
 2426     1  2426    62  Rs                          ftp-proxy
 2397     1  2397     0  Ss      select   0xc2d71964 inetd
 2371     1  2371     0  Rs                          sshd
 2349     1  2349     0  Rs                          bird
 2302     1  2302     0  Rs                          radvd
 2298  2292  2285 65534  S       piperd   0xc2624188 multilog
 2297  2291  2285     0  R                           openvpn
 2292  2287  2285     0  S       select   0xc2d713a4 supervise
 2291  2287  2285     0  S       select   0xc2d71a64 supervise
 2288     1  2285     0  S       piperd   0xc2624c40 readproctitle
 2287     1  2285     0  R                           svscan
 2244     1  2244     0  Rs                          ntpd
 2129     1  2129     0  Rs                          syslogd
   50     0     0     0  SL      mdwait   0xc2717000 [md1]
   40     0     0     0  SL      mdwait   0xc24ef000 [md0]
   22     0     0     0  RL                          [flowcleaner]
   21     0     0     0  SL      sdflush  0xc0b24d80 [softdepflush]
   20     0     0     0  SL      vlruwt   0xc25f52a8 [vnlru]
   19     0     0     0  SL      syncer   0xc0b14774 [syncer]
   18     0     0     0  RL                          [bufdaemon]
   17     0     0     0  RL                          [pagezero]
   16     0     0     0  SL      psleep   0xc0b2567c [vmdaemon]
   15     0     0     0  SL      psleep   0xc0b25644 [pagedaemon]
    9     0     0     0  SL      ccb_scan 0xc0b006d4 [xpt_thrd]
    8     0     0     0  RL                          [pfpurge]
    7     0     0     0  SL      waiting_ 0xc0b181d8 [sctp_iterator]
   14     0     0     0  SL      (threaded)          usb
100036                   D       -        0xc250ed34 [usbus1]
100035                   D       -        0xc250ed04 [usbus1]
100034                   D       -        0xc250ecd4 [usbus1]
100033                   D       -        0xc250eca4 [usbus1]
100032                   D       -        0xc2504b5c [usbus0]
100031                   D       -        0xc2504b2c [usbus0]
100030                   D       -        0xc2504afc [usbus0]
100029                   D       -        0xc2504acc [usbus0]
   13     0     0     0  SL      -        0xc0b0df64 [yarrow]
    6     0     0     0  SL      crypto_r 0xc0b2432c [crypto returns]
    5     0     0     0  SL      crypto_w 0xc0b24308 [crypto]
    4     0     0     0  SL      -        0xc0b0bc44 [g_down]
    3     0     0     0  SL      -        0xc0b0bc40 [g_up]
    2     0     0     0  SL      -        0xc0b0bc38 [g_event]
   12     0     0     0  RL      (threaded)          intr
100037                   I                           [swi0: uart uart]
100028                   I                           [irq12: ohci0 ehci0]
100027                   I                           [irq14: ata0]
100026                   I                           [irq15: vr2 ata1]
100025                   Run     CPU 0               [irq11: vr1]
100024                   RunQ                        [irq10: vr0]
100021                   I                           [swi5: +]
100016                   I                           [swi2: cambio]
100015                   I                           [swi6: task queue]
100014                   I                           [swi6: Giant taskq]
100006                   I                           [swi1: netisr 0]
100005                   I                           [swi4: clock]
100004                   I                           [swi3: vm]
   11     0     0     0  RL                          [idle]
    1     0     1     0  SLs     wait     0xc2433d48 [init]
   10     0     0     0  SL      audit_wo 0xc0b246a0 [audit]
    0     0     0     0  RLs     (threaded)          kernel
100040                   D       -        0xc2527880 [dummynet]
100023                   D       -        0xc24d6580 [glxsb0 taskq]
100022                   RunQ                        [thread taskq]
100020                   D       -        0xc24afd40 [acpi_task_2]
100019                   D       -        0xc24afd40 [acpi_task_1]
100018                   D       -        0xc24afd40 [acpi_task_0]
100017                   D       -        0xc24afd80 [kqueue taskq]
100010                   D       -        0xc24512c0 [firmware taskq]
100000                   D       sched    0xc0b0bd20 [swapper]


db>  tr 100025
Tracing pid 12 tid 100025 td 0xc24d42d0
kdb_enter(c0a48fbe,c0a5d00a,1,c2522280,0,...) at 0xc0733b3a = kdb_enter+0x3a
uart_intr(c2522200,c24d42d0,c0afdc90,c2430980,4,...) at 0xc0632206 = uart_intr+0x126
intr_event_handle(c2430980,cc918ac4,cc918aa4,cc918b40,4,...) at 0xc06dff2c = intr_event_handle+0x5c
intr_execute_handlers(c0afdc90,cc918ac4,cc918b40,c2ef8a00,cc918b68,...) at 0xc09d4929 = intr_execute_handlers+0x49
atpic_handle_intr(4,cc918ac4) at 0xc09ee047 = atpic_handle_intr+0x67
Xatpic_intr4() at 0xc09cfa52 = Xatpic_intr4+0x22
--- interrupt, eip = 0xc09e0f58, esp = 0xcc918b04, ebp = 0xcc918b68 ---
bzero(c2ef8a00,0,12,0,0,...) at 0xc09e0f58 = bzero
ip_input(c2ef8a00,246,c24a77c0,cc918bd0,c0622c21,...) at 0xc083ff8d = ip_input+0x6cd
netisr_dispatch_src(1,0,c2ef8a00,cc918c08,c07bb331,...) at 0xc07c9109 = netisr_dispatch_src+0x89
netisr_dispatch(1,c2ef8a00,c24d8c00,c24d8c00,c2f16008,...) at 0xc07c93a0 = netisr_dispatch+0x20
ether_demux(c24d8c00,c2ef8a00,3,0,3,...) at 0xc07bb331 = ether_demux+0x161
ether_input(c24d8c00,c2ef8a00,c09d6ad0,c0b11780,c24d42d0,...) at 0xc07bb8af = ether_input+0x33f
vr_intr(c2499000,c24d42d0,0,109,dc13a3e2,...) at 0xc0690034 = vr_intr+0x4d4
intr_event_execute_handlers(c24337f8,c2430600,c0a6b180,52d,c2430670,...) at 0xc06def8b = intr_event_execute_handlers+0x13b
ithread_loop(c24de360,cc918d28,0,0,0,...) at 0xc06e05ea = ithread_loop+0x6a
fork_exit(c06e0580,c24de360,cc918d28) at 0xc06dc830 = fork_exit+0x90
fork_trampoline() at 0xc09cf964 = fork_trampoline+0x8
--- trap 0, eip = 0, esp = 0xcc918d60, ebp = 0 ---

db> tr 100024
Tracing pid 12 tid 100024 td 0xc24d45a0
sched_switch(c24d45a0,0,109,dc1435c0,660,...) at 0xc07270bf = sched_switch+0x1df
mi_switch(109,0,c0a6b180,52d,c24306f0,...) at 0xc071086a = mi_switch+0x11a
ithread_loop(c24de8c0,cc894d28,0,0,0,...) at 0xc06e06df = ithread_loop+0x15f
fork_exit(c06e0580,c24de8c0,cc894d28) at 0xc06dc830 = fork_exit+0x90
fork_trampoline() at 0xc09cf964 = fork_trampoline+0x8
--- trap 0, eip = 0, esp = 0xcc894d60, ebp = 0 ---


db> cont
KDB: enter: Line break on console
[thread pid 0 tid 100022 ]
Stopped at      0xc0733b3a = kdb_enter+0x3a:    movl    $0,0xc0b124b4 = kdb_why
db> ps 
  pid  ppid  pgrp   uid   state   wmesg     wchan    cmd
 3300  2371  2371     0  R                           sshd
 3299  3298  3298     0  S+      sbwait   0xc2ddb58c ssh
 3298  3100  3298     0  S+      piperd   0xc2624620 scp
 3100  3098  3100     0  Ss+     pause    0xc2e95850 csh
 3098  2371  3098     0  Rs                          sshd
 2904  2371  2904     0  Ss      select   0xc379a924 sshd
 2476  2474  2476     0  Ss+     ttyin    0xc2e44070 csh
 2474  2371  2474     0  Rs                          sshd
 2440     1  2440     0  Rs+                         getty
 2433     1  2433     0  Rs                          cron
 2426     1  2426    62  Rs                          ftp-proxy
 2397     1  2397     0  Ss      select   0xc2d71964 inetd
 2371     1  2371     0  Rs                          sshd
 2349     1  2349     0  Rs                          bird
 2302     1  2302     0  Rs                          radvd
 2298  2292  2285 65534  S       piperd   0xc2624188 multilog
 2297  2291  2285     0  R                           openvpn
 2292  2287  2285     0  S       select   0xc2d713a4 supervise
 2291  2287  2285     0  S       select   0xc2d71a64 supervise
 2288     1  2285     0  S       piperd   0xc2624c40 readproctitle
 2287     1  2285     0  R                           svscan
 2244     1  2244     0  Rs                          ntpd
 2129     1  2129     0  Rs                          syslogd
   50     0     0     0  SL      mdwait   0xc2717000 [md1]
   40     0     0     0  SL      mdwait   0xc24ef000 [md0]
   22     0     0     0  RL                          [flowcleaner]
   21     0     0     0  SL      sdflush  0xc0b24d80 [softdepflush]
   20     0     0     0  SL      vlruwt   0xc25f52a8 [vnlru]
   19     0     0     0  SL      syncer   0xc0b14774 [syncer]
   18     0     0     0  RL                          [bufdaemon]
   17     0     0     0  RL                          [pagezero]
   16     0     0     0  SL      psleep   0xc0b2567c [vmdaemon]
   15     0     0     0  SL      psleep   0xc0b25644 [pagedaemon]
    9     0     0     0  SL      ccb_scan 0xc0b006d4 [xpt_thrd]
    8     0     0     0  RL                          [pfpurge]
    7     0     0     0  SL      waiting_ 0xc0b181d8 [sctp_iterator]
   14     0     0     0  SL      (threaded)          usb
100036                   D       -        0xc250ed34 [usbus1]
100035                   D       -        0xc250ed04 [usbus1]
100034                   D       -        0xc250ecd4 [usbus1]
100033                   D       -        0xc250eca4 [usbus1]
100032                   D       -        0xc2504b5c [usbus0]
100031                   D       -        0xc2504b2c [usbus0]
100030                   D       -        0xc2504afc [usbus0]
100029                   D       -        0xc2504acc [usbus0]
   13     0     0     0  SL      -        0xc0b0df64 [yarrow]
    6     0     0     0  SL      crypto_r 0xc0b2432c [crypto returns]
    5     0     0     0  SL      crypto_w 0xc0b24308 [crypto]
    4     0     0     0  SL      -        0xc0b0bc44 [g_down]
    3     0     0     0  SL      -        0xc0b0bc40 [g_up]
    2     0     0     0  SL      -        0xc0b0bc38 [g_event]
   12     0     0     0  WL      (threaded)          intr
100037                   I                           [swi0: uart uart]
100028                   I                           [irq12: ohci0 ehci0]
100027                   I                           [irq14: ata0]
100026                   I                           [irq15: vr2 ata1]
100025                   I                           [irq11: vr1]
100024                   I                           [irq10: vr0]
100021                   I                           [swi5: +]
100016                   I                           [swi2: cambio]
100015                   I                           [swi6: task queue]
100014                   I                           [swi6: Giant taskq]
100006                   I                           [swi1: netisr 0]
100005                   I                           [swi4: clock]
100004                   I                           [swi3: vm]
   11     0     0     0  RL                          [idle]
    1     0     1     0  SLs     wait     0xc2433d48 [init]
   10     0     0     0  SL      audit_wo 0xc0b246a0 [audit]
    0     0     0     0  RLs     (threaded)          kernel
100040                   D       -        0xc2527880 [dummynet]
100023                   D       -        0xc24d6580 [glxsb0 taskq]
100022                   Run     CPU 0               [thread taskq]
100020                   D       -        0xc24afd40 [acpi_task_2]
100019                   D       -        0xc24afd40 [acpi_task_1]
100018                   D       -        0xc24afd40 [acpi_task_0]
100017                   D       -        0xc24afd80 [kqueue taskq]
100010                   D       -        0xc24512c0 [firmware taskq]
100000                   D       sched    0xc0b0bd20 [swapper]
db>  tr 100024
Tracing pid 12 tid 100024 td 0xc24d45a0
sched_switch(c24d45a0,0,109,b67db121,673,...) at 0xc07270bf = sched_switch+0x1df
mi_switch(109,0,c0a6b180,52d,c24306f0,...) at 0xc071086a = mi_switch+0x11a
ithread_loop(c24de8c0,cc894d28,0,0,0,...) at 0xc06e06df = ithread_loop+0x15f
fork_exit(c06e0580,c24de8c0,cc894d28) at 0xc06dc830 = fork_exit+0x90
fork_trampoline() at 0xc09cf964 = fork_trampoline+0x8
--- trap 0, eip = 0, esp = 0xcc894d60, ebp = 0 ---
db> tr 100022
Tracing pid 0 tid 100022 td 0xc24d4b40
kdb_enter(c0a48fbe,c0a5d00a,0,c2522280,0,...) at 0xc0733b3a = kdb_enter+0x3a
uart_intr(c2522200,c24d4b40,c0afdc90,c2430980,4,...) at 0xc0632206 = uart_intr+0x126
intr_event_handle(c2430980,c23f5c20,0,c25db2bc,4,...) at 0xc06dff2c = intr_event_handle+0x5c
intr_execute_handlers(c0afdc90,c23f5c20,c25db2bc,0,c23f5c60,...) at 0xc09d4929 = intr_execute_handlers+0x49
atpic_handle_intr(4,c23f5c20) at 0xc09ee047 = atpic_handle_intr+0x67
Xatpic_intr4() at 0xc09cfa52 = Xatpic_intr4+0x22
--- interrupt, eip = 0xc073cf01, esp = 0xc23f5c60, ebp = 0xc23f5c60 ---
sleepq_release(c25db2bc,0,0,0,c24afc80,...) at 0xc073cf01 = sleepq_release+0x31
wakeup(c25db2bc,0,0,0,0,...) at 0xc07109f1 = wakeup+0x41
taskqueue_run_locked(c24afc80,c24afc9c,0,c0a66f59,0,...) at 0xc073e8e2 = taskqueue_run_locked+0x142
taskqueue_thread_loop(c0b134ec,c23f5d28,0,0,0,...) at 0xc073ef27 = taskqueue_thread_loop+0x87
fork_exit(c073eea0,c0b134ec,c23f5d28) at 0xc06dc830 = fork_exit+0x90
fork_trampoline() at 0xc09cf964 = fork_trampoline+0x8
--- trap 0, eip = 0, esp = 0xc23f5d60, ebp = 0 ---

... cont dance again ...

db> tr 100022
Tracing pid 0 tid 100022 td 0xc24d4b40
kdb_enter(c0a48fbe,c0a5d00a,0,c2522280,0,...) at 0xc0733b3a = kdb_enter+0x3a
uart_intr(c2522200,c24d4b40,c0afdc90,c2430980,4,...) at 0xc0632206 = uart_intr+0x126
intr_event_handle(c2430980,c23f5c48,c23f5c14,c24afc80,4,...) at 0xc06dff2c = intr_event_handle+0x5c
intr_execute_handlers(c0afdc90,c23f5c48,c24afc80,c25db2bc,c23f5cbc,...) at 0xc09d4929 = intr_execute_handlers+0x49
atpic_handle_intr(4,c23f5c48) at 0xc09ee047 = atpic_handle_intr+0x67
Xatpic_intr4() at 0xc09cfa52 = Xatpic_intr4+0x22
--- interrupt, eip = 0xc073e7e3, esp = 0xc23f5c88, ebp = 0xc23f5cbc ---
taskqueue_run_locked(c24afc80,c24afc9c,0,c0a66f59,0,...) at 0xc073e7e3 = taskqueue_run_locked+0x43
taskqueue_thread_loop(c0b134ec,c23f5d28,0,0,0,...) at 0xc073ef27 = taskqueue_thread_loop+0x87
fork_exit(c073eea0,c0b134ec,c23f5d28) at 0xc06dc830 = fork_exit+0x90
fork_trampoline() at 0xc09cf964 = fork_trampoline+0x8
--- trap 0, eip = 0, esp = 0xc23f5d60, ebp = 0 ---

db> tr 100024
Tracing pid 12 tid 100024 td 0xc24d45a0
sched_switch(c24d45a0,0,109,4b19c082,684,...) at 0xc07270bf = sched_switch+0x1df
mi_switch(109,0,c0a6b180,52d,c24306f0,...) at 0xc071086a = mi_switch+0x11a
ithread_loop(c24de8c0,cc894d28,0,0,0,...) at 0xc06e06df = ithread_loop+0x15f
fork_exit(c06e0580,c24de8c0,cc894d28) at 0xc06dc830 = fork_exit+0x90
fork_trampoline() at 0xc09cf964 = fork_trampoline+0x8
--- trap 0, eip = 0, esp = 0xcc894d60, ebp = 0 ---

db> tr 100025
Tracing pid 12 tid 100025 td 0xc24d42d0
kdb_enter(c0a48fbe,c0a5d00a,58,c2522280,0,...) at 0xc0733b3a = kdb_enter+0x3a
uart_intr(c2522200,c24d42d0,c0afdc90,c2430980,4,...) at 0xc0632206 = uart_intr+0x126
intr_event_handle(c2430980,cc918ac8,0,0,4,...) at 0xc06dff2c = intr_event_handle+0x5c
intr_execute_handlers(c0afdc90,cc918ac8,0,c2efa300,cc918b68,...) at 0xc09d4929 = intr_execute_handlers+0x49
atpic_handle_intr(4,cc918ac8) at 0xc09ee047 = atpic_handle_intr+0x67
Xatpic_intr4() at 0xc09cfa52 = Xatpic_intr4+0x22
--- interrupt, eip = 0xc083e822, esp = 0xcc918b08, ebp = 0xcc918b68 ---
ip_forward(c2efa300,0,12,0,0,...) at 0xc083e822 = ip_forward+0x3b2
ip_input(c2efa300,246,c24a1580,cc918bd0,c0622c21,...) at 0xc083ff8d = ip_input+0x6cd
netisr_dispatch_src(1,0,c2efa300,cc918c08,c07bb331,...) at 0xc07c9109 = netisr_dispatch_src+0x89
netisr_dispatch(1,c2efa300,c24d8c00,c24d8c00,c2f1a808,...) at 0xc07c93a0 = netisr_dispatch+0x20
ether_demux(c24d8c00,c2efa300,3,0,3,...) at 0xc07bb331 = ether_demux+0x161
ether_input(c24d8c00,c2efa300,c09d6ad0,c0b11780,c24d42d0,...) at 0xc07bb8af = ether_input+0x33f
vr_intr(c2499000,c24d42d0,0,109,48d02b08,...) at 0xc0690034 = vr_intr+0x4d4
intr_event_execute_handlers(c24337f8,c2430600,c0a6b180,52d,c2430670,...) at 0xc06def8b = intr_event_execute_handlers+0x13b
ithread_loop(c24de360,cc918d28,0,0,0,...) at 0xc06e05ea = ithread_loop+0x6a
fork_exit(c06e0580,c24de360,cc918d28) at 0xc06dc830 = fork_exit+0x90
fork_trampoline() at 0xc09cf964 = fork_trampoline+0x8
--- trap 0, eip = 0, esp = 0xcc918d60, ebp = 0 ---




More information about the freebsd-net mailing list