mpd has hung

Mikolaj Golub to.my.trociny at gmail.com
Fri Feb 19 13:41:41 UTC 2010


On Thu, 18 Feb 2010 17:32:37 +0200 Nikos Vassiliadis wrote:

> On 2/17/2010 3:26 PM, Alexander Shikoff wrote:
>> Hello All,
>>
>> I have mpd 5.3 running on 8.0-RC1 as PPPoE server (now only 5 clients).
>> Today mpd process hung and I cannot kill it with -9 signal, and I cannot
>> access it's console via telnet.
>>
>> State of process in `top` output is STOP:
>> 73551 root          2  44    0 29588K  5692K STOP    6   0:32  0.00% mpd5
>>
>> # procstat -kk 73551
>>    PID    TID COMM             TDNAME           KSTACK
>> 73551 100233 mpd5             -                mi_switch+0x16f sleepq_wait+0x42 _cv_wait+0x111 flowtable_flush+0x51 if_detach+0x2f2 ng_iface_shutdown+0x1e ng_rmnode+0x167 ng_apply_item+0xef7 ng_snd_item+0x2ce ngc_send+0x1d2 sosend_generic+0x3f6 kern_sendit+0x13d sendit+0xdc sendto+0x4d syscall+0x1da Xfast_syscall+0xe1
>> 73551 100502 mpd5             -                mi_switch+0x16f thread_suspend_switch+0xc6 thread_single+0x1b6 exit1+0x72 sigexit+0x7c postsig+0x306 ast+0x279 doreti_ast+0x1f
>>
>> Is there a way to stop a process without rebooting a whole system?
>> Thanks in advance!
>>
>> P.S. I'm ready for experiments with it before tonight, but I cannot
>> force system to crash in order to get crash dump right now.
>>
>
> It's probably too late now, but are you sure that nobody pressed
> CTLR-Z while in the mpd console???
>
> CTLR-Z will send SIGSTOP to the process and the process will
> stop. While stopped, all processing stops(including receiving
> SIGKILL, you cannot kill it, and the signals are queued). You
> have to send SIGCONT for the process to continue.

We were discussing this problem with Alexander in another (Russian/Ukrainian
speaking) maillist. And it looks like the problem is the following.

mpd5 thread was detaching ng interface and when doing flowtable_flush() it
slept in cv_wait waiting for flowclean_cycles variable to be updated. It
should have been awaken by flowcleaner thread but this thread got stuck in
endless loop, supposedly in flowtable_clean_vnet()/flowtable_free_stale(), I
think because of inconsistent state of some lists (iface?) due to if_detach
being in progress.

BTW the problem repeated after reboot.

Below is some debugging info we have managed to get so far.

So, stack traces of the threads of gotten stuck mpd5 process:

  PID    TID COMM             TDNAME           KSTACK                       
73551 100233 mpd5             -                mi_switch+0x16f sleepq_wait+0x42 _cv_wait+0x111 flowtable_flush+0x51 if_detach+0x2f2 ng_iface_shutdown+0x1e ng_rmnode+0x167 ng_apply_item+0xef7 ng_snd_item+0x2ce ngc_send+0x1d2 sosend_generic+0x3f6 kern_sendit+0x13d sendit+0xdc sendto+0x4d syscall+0x1da Xfast_syscall+0xe1 
73551 100502 mpd5             -                mi_switch+0x16f thread_suspend_switch+0xc6 thread_single+0x1b6 exit1+0x72 sigexit+0x7c postsig+0x306 ast+0x279 doreti_ast+0x1f 

proctat -k for flowcleaner:

  PID    TID COMM             TDNAME           KSTACK                       
   19 100088 flowcleaner      -                <running>                    

# top -S -H 10000 |grep 'flowcleaner\|73551'
   19 root        44    -     0K    16K CPU2    2  23.8H  0.00% flowcleaner
73551 root        44    0 29588K  5692K STOP    6   0:32  0.00% {mpd5}
73551 root        44    0 29588K  5692K STOP    6   0:01  0.00% {mpd5}

(kgdb) print flowclean_cycles
$4 = 563

And several hours later:

# top -S -H 10000 | grep 'flowcleaner\|73551'
   19 root        44    -     0K    16K CPU2    2  28.6H  0.00% flowcleaner
73551 root        44    0 29588K  5692K STOP    6   0:32  0.00% {mpd5}
73551 root        44    0 29588K  5692K STOP    6   0:01  0.00% {mpd5}

(kgdb) print flowclean_cycles
$1 = 563
 
So, flowcleaner was in endless loop and as flowclean_cycles did not change it
was somewhere in flowtable_clean_vnet()/flowtable_free_stale().

mdp5 thread was in cv_wait() (according to proctat -k) holding flowclean_lock:

(kgdb) print flowclean_lock
$1 = {lock_object = {lo_name = 0xffffffff804d4f3f "flowclean lock", lo_flags = 16973824, lo_data = 0, lo_witness = 0x0}, mtx_lock = 18446742977425993728}

(kgdb) print *(struct thread *) (flowclean_lock->mtx_lock & ~0x3)
$5 = {td_lock = 0xffffffff8068bec0, td_proc = 0xffffff0012c07000, td_plist = {tqe_next = 0x0, tqe_prev = 0xffffff00c06483a0}, td_runq = {tqe_next = 0x0, 
    tqe_prev = 0xffffffff8068c128}, td_slpq = {tqe_next = 0x0, tqe_prev = 0xffffff00c0681240}, td_lockq = {tqe_next = 0x0, tqe_prev = 0xffffff80eba7f820}, 
  td_cpuset = 0xffffff0001830dc8, td_sel = 0xffffff0031639400, td_sleepqueue = 0xffffff00c0681240, td_turnstile = 0xffffff00c0677900, td_umtxq = 0xffffff0031aea300, 
  td_tid = 100408, td_sigqueue = {sq_signals = {__bits = {0, 0, 0, 0}}, sq_kill = {__bits = {0, 0, 0, 0}}, sq_list = {tqh_first = 0x0, tqh_last = 0xffffff00c06870a0}, 
    sq_proc = 0xffffff0012c07000, sq_flags = 1}, td_flags = 34820, td_inhibitors = 0, td_pflags = 0, td_dupfd = 0, td_sqqueue = 0, td_wchan = 0x0, td_wmesg = 0x0, 
  td_lastcpu = 0 '\0', td_oncpu = 0 '\0', td_owepreempt = 0 '\0', td_tsqueue = 0 '\0', td_locks = 30, td_rw_rlocks = 0, td_lk_slocks = 0, td_blocked = 0x0, 
  td_lockname = 0x0, td_contested = {lh_first = 0x0}, td_sleeplocks = 0x0, td_intr_nesting_level = 0, td_pinned = 0, td_ucred = 0xffffff01d5ea7500, td_estcpu = 0, 
  td_slptick = 0, td_ru = {ru_utime = {tv_sec = 0, tv_usec = 0}, ru_stime = {tv_sec = 0, tv_usec = 0}, ru_maxrss = 5484, ru_ixrss = 22943676, ru_idrss = 75964776, 
    ru_isrss = 12339456, ru_minflt = 300, ru_majflt = 0, ru_nswap = 0, ru_inblock = 3, ru_oublock = 0, ru_msgsnd = 140, ru_msgrcv = 19, ru_nsignals = 0, 
    ru_nvcsw = 87104012, ru_nivcsw = 5062}, td_incruntime = 273488034786, td_runtime = 850522426220, td_pticks = 48190, td_sticks = 15544, td_iticks = 0, 
  td_uticks = 0, td_intrval = 0, td_oldsigmask = {__bits = {0, 0, 0, 0}}, td_sigmask = {__bits = {0, 0, 0, 0}}, td_generation = 87109074, td_sigstk = {ss_sp = 0x0, 
    ss_size = 0, ss_flags = 0}, td_xsig = 0, td_profil_addr = 0, td_profil_ticks = 0, td_name = "mpd5", '\0' <repeats 15 times>, td_fpop = 0x0, td_dbgflags = 0, 
  td_ng_outbound = 0, td_osd = {osd_nslots = 0, osd_slots = 0x0, osd_next = {le_next = 0x0, le_prev = 0x0}}, td_rqindex = 34 '"', td_base_pri = 136 '\210', 
  td_priority = 136 '\210', td_pri_class = 3 '\003', td_user_pri = 181 '╣', td_base_user_pri = 181 '╣', td_pcb = 0xffffff80ebb97d40, td_state = TDS_RUNNING, 
  td_retval = {0, 56}, td_slpcallout = {c_links = {sle = {sle_next = 0xffffffff806ac720}, tqe = {tqe_next = 0xffffffff806ac720, tqe_prev = 0xffffff80a5610f90}}, 
    c_time = 805103, c_arg = 0xffffff00c0687000, c_func = 0xffffffff80311050 <sleepq_timeout>, c_lock = 0x0, c_flags = 18, c_cpu = 0}, td_frame = 0xffffff80ebb97c80, 
  td_kstack_obj = 0xffffff00c04fbca8, td_kstack = 18446743527908524032, td_kstack_pages = 4, td_unused1 = 0x0, td_unused2 = 0, td_unused3 = 0, td_critnest = 1, 
  td_md = {md_spinlock_count = 1, md_saved_flags = 582}, td_sched = 0xffffff00c0687368, td_ar = 0x0, td_syscalls = 566, td_lprof = {{lh_first = 0x0}, {
      lh_first = 0x0}}, td_dtrace = 0x0, td_errno = 0, td_vnet = 0x0, td_vnet_lpush = 0x0}

(kgdb) print *((struct thread *) (flowclean_lock->mtx_lock & ~0x3))->td_proc
$6 = {p_list = {le_next = 0xffffff006eb2d000, le_prev = 0xffffff0012c06000}, p_threads = {tqh_first = 0xffffff00c0648390, tqh_last = 0xffffff00c0687010}, p_slock = {
    lock_object = {lo_name = 0xffffffff804c8a20 "process slock", lo_flags = 720896, lo_data = 0, lo_witness = 0x0}, mtx_lock = 4}, p_ucred = 0xffffff01d5ea7500, 
  p_fd = 0xffffff0012f96a00, p_fdtol = 0x0, p_stats = 0xffffff00128ab200, p_limit = 0xffffff0012ab6200, p_limco = {c_links = {sle = {sle_next = 0x0}, tqe = {
        tqe_next = 0x0, tqe_prev = 0x0}}, c_time = 0, c_arg = 0x0, c_func = 0, c_lock = 0xffffff0012c070f8, c_flags = 0, c_cpu = 0}, p_sigacts = 0xffffff0004c09000, 
  p_flag = 268960896, p_state = PRS_NORMAL, p_pid = 67727, p_hash = {le_next = 0x0, le_prev = 0xffffff800021e478}, p_pglist = {le_next = 0x0, 
    le_prev = 0xffffff003163a110}, p_pptr = 0xffffff000183a000, p_sibling = {le_next = 0xffffff0004c54000, le_prev = 0xffffff000183a0f0}, p_children = {
    lh_first = 0x0}, p_mtx = {lock_object = {lo_name = 0xffffffff804c8a13 "process lock", lo_flags = 21168128, lo_data = 0, lo_witness = 0x0}, mtx_lock = 4}, 
  p_ksi = 0xffffff0004af5380, p_sigqueue = {sq_signals = {__bits = {0, 0, 0, 0}}, sq_kill = {__bits = {0, 0, 0, 0}}, sq_list = {tqh_first = 0x0, 
      tqh_last = 0xffffff0012c07140}, sq_proc = 0xffffff0012c07000, sq_flags = 1}, p_oppid = 0, p_vmspace = 0xffffff0012960dc8, p_swtick = 797875, p_realtimer = {
    it_interval = {tv_sec = 0, tv_usec = 0}, it_value = {tv_sec = 0, tv_usec = 0}}, p_ru = {ru_utime = {tv_sec = 0, tv_usec = 0}, ru_stime = {tv_sec = 0, 
      tv_usec = 0}, ru_maxrss = 0, ru_ixrss = 0, ru_idrss = 0, ru_isrss = 0, ru_minflt = 168, ru_majflt = 0, ru_nswap = 0, ru_inblock = 4, ru_oublock = 0, 
    ru_msgsnd = 47, ru_msgrcv = 6, ru_nsignals = 0, ru_nvcsw = 19, ru_nivcsw = 4}, p_rux = {rux_runtime = 577829035842, rux_uticks = 1, rux_sticks = 32657, 
    rux_iticks = 0, rux_uu = 23270, rux_su = 231217022, rux_tu = 231224103}, p_crux = {rux_runtime = 3633037, rux_uticks = 0, rux_sticks = 1, rux_iticks = 0, 
    rux_uu = 0, rux_su = 1453, rux_tu = 1453}, p_profthreads = 0, p_exitthreads = 0, p_traceflag = 0, p_tracevp = 0x0, p_tracecred = 0x0, 
  p_textvp = 0xffffff01ff63c938, p_lock = 0, p_sigiolst = {slh_first = 0x0}, p_sigparent = 20, p_sig = 0, p_code = 0, p_stops = 0, p_stype = 0, p_step = 0 '\0', 
  p_pfsflags = 0 '\0', p_nlminfo = 0x0, p_aioinfo = 0x0, p_singlethread = 0xffffff00c0648390, p_suspcount = 1, p_xthread = 0x0, p_boundary_count = 0, 
  p_pendingcnt = 0, p_itimers = 0x0, p_magic = 3203398350, p_osrel = 800107, p_comm = "mpd5", '\0' <repeats 15 times>, p_pgrp = 0xffffff003163a100, 
  p_sysent = 0xffffffff8064bfc0, p_args = 0xffffff0004b629c0, p_cpulimit = 9223372036854775807, p_nice = 0 '\0', p_fibnum = 0, p_xstat = 0, p_klist = {kl_list = {
      slh_first = 0x0}, kl_lock = 0xffffffff802aaac0 <knlist_mtx_lock>, kl_unlock = 0xffffffff802aaa90 <knlist_mtx_unlock>, 
    kl_assert_locked = 0xffffffff802a7de0 <knlist_mtx_assert_locked>, kl_assert_unlocked = 0xffffffff802a7df0 <knlist_mtx_assert_unlocked>, 
    kl_lockarg = 0xffffff0012c070f8}, p_numthreads = 2, p_md = {md_ldt = 0x0, md_ldt_sd = {sd_lolimit = 0, sd_lobase = 0, sd_type = 0, sd_dpl = 0, sd_p = 0, 
      sd_hilimit = 0, sd_xx0 = 0, sd_gran = 0, sd_hibase = 0, sd_xx1 = 0, sd_mbz = 0, sd_xx2 = 0}}, p_itcallout = {c_links = {sle = {sle_next = 0x0}, tqe = {
        tqe_next = 0x0, tqe_prev = 0x0}}, c_time = 0, c_arg = 0x0, c_func = 0, c_lock = 0x0, c_flags = 16, c_cpu = 0}, p_acflag = 17, p_peers = 0x0, 
  p_leader = 0xffffff0012c07000, p_emuldata = 0x0, p_label = 0x0, p_sched = 0xffffff0012c07460, p_ktr = {stqh_first = 0x0, stqh_last = 0xffffff0012c07430}, 
  p_mqnotifier = {lh_first = 0x0}, p_dtrace = 0x0, p_pwait = {cv_description = 0xffffffff804c919f "ppwait", cv_waiters = 0}}

Unfortunately there is no stack trace for flowcleaner. I have asked Alexander
to make the kernel panic on the next reboot and provide backtrace of
flowcleaner thread from the crush dump but I don't know if he has managed to
do this (this is a production host, which complicates things).


-- 
Mikolaj Golub


More information about the freebsd-net mailing list