snapshots and innds

Steve Watt steve at Watt.COM
Sun May 22 21:02:41 GMT 2005


[ OK, there's a lot of text in here, but I have definitively found a
  deadlock between ffs_mksnap and msync(). ]

Waaay back on Dec 18, 17:39, I wrote:
} Subject: snapshots and innds
} I'm getting a strong hunch that snapshots and inn don't get along
} well, presumably having something to do with inn's extensive use
} of mmap().
} 
} Just for an example, my system panic()ed earlier today (different
} problem) and during the reboot, I'm stuck with an fsck_ufs on wchan
} "ufs" and innd on wchan "suspfs", and neither of them responding
} in any way.

And I have been seeing hangs periodically since December that all
seem to implicate innd(msync()) arguing with dump(mksnap_ffs).

The system is 5.4-STABLE, updated last on the (PDT) morning of 2 May.

Finally, this morning, I got a kernel core dump that I can do useful
stuff with.  The system was mostly operating normally, except that
any attempt to access the /news partition (which has articles,
tradspool.map, overviews, and incoming/outgoing data) would get
stuck in "suspfs".

So I forced a dump from ddb.  The mount point does (as one would
expect) have MNTK_SUSPEND set.

I see mksnap_ffs sitting waiting for "ufs" (really vnode 0xc19af318),
which it got to via:

(kgdb) info stack
#0  sched_switch (td=0xc1ede780, newtd=0xc146f480, flags=1)
    at /usr/src/sys/kern/sched_4bsd.c:882
#1  0xc0662ad0 in mi_switch (flags=1, newtd=0x0) at /usr/src/sys/kern/kern_synch.c:355
#2  0xc067a9e4 in sleepq_switch (wchan=0x0) at /usr/src/sys/kern/subr_sleepqueue.c:406
#3  0xc067ab9e in sleepq_wait (wchan=0x0) at /usr/src/sys/kern/subr_sleepqueue.c:518
#4  0xc06627b6 in msleep (ident=0xc19af3c4, mtx=0xc095e4cc, priority=80, 
    wmesg=0xc08a3f13 "ufs", timo=0) at /usr/src/sys/kern/kern_synch.c:228
#5  0xc06505d6 in acquire (lkpp=0xd02df680, extflags=16777280, wanted=1536)
    at /usr/src/sys/kern/kern_lock.c:161
#6  0xc0650a14 in lockmgr (lkp=0xc19af3c4, flags=16842754, interlkp=0x0, td=0xc1ede780)
    at /usr/src/sys/kern/kern_lock.c:389
#7  0xc07bd6e3 in ufs_lock (ap=0xd02df6bc) at /usr/src/sys/ufs/ufs/ufs_vnops.c:2007
#8  0xc07be380 in ufs_vnoperate (ap=0x0) at /usr/src/sys/ufs/ufs/ufs_vnops.c:2828
#9  0xc06c0501 in vn_lock (vp=0xc19af318, flags=65538, td=0xc1ede780) at vnode_if.h:1013
#10 0xc06b4195 in vget (vp=0xc19af318, flags=65538, td=0x0)
    at /usr/src/sys/kern/vfs_subr.c:2028
#11 0xc07af408 in ffs_sync (mp=0xc15e5c00, waitfor=1, cred=0xc2953080, td=0xc1ede780)
    at /usr/src/sys/ufs/ffs/ffs_vfsops.c:1151
#12 0xc06c0840 in vfs_write_suspend (mp=0xc15e5c00) at /usr/src/sys/kern/vfs_vnops.c:1084
#13 0xc079db18 in ffs_snapshot (mp=0xc15e5c00, 
    snapfile=0xbfbfef1b <Address 0xbfbfef1b out of bounds>)
    at /usr/src/sys/ufs/ffs/ffs_snapshot.c:317
#14 0xc07ad5d8 in ffs_omount (mp=0xc15e5c00, path=0xc2a8c380 "/news", data=0x0, 
    td=0xc1ede780) at /usr/src/sys/ufs/ffs/ffs_vfsops.c:313
#15 0xc06af787 in vfs_domount (td=0xc1ede780, fstype=0xc1eea730 "ffs", 
    fspath=0xc2a8c380 "/news", fsflags=18944000, fsdata=0xbfbfe7d4, compat=1)
    at /usr/src/sys/kern/vfs_mount.c:861
#16 0xc06aef16 in mount (td=0x0, uap=0xd02dfd04) at /usr/src/sys/kern/vfs_mount.c:620
#17 0xc0828553 in syscall (frame=
   [ snip ]

And inn is sitting waiting for the suspended filesystem:
(kgdb) info stack
#0  sched_switch (td=0xc1c16c00, newtd=0xc1ede780, flags=1)
    at /usr/src/sys/kern/sched_4bsd.c:882
#1  0xc0662ad0 in mi_switch (flags=1, newtd=0x0) at /usr/src/sys/kern/kern_synch.c:355
#2  0xc067a9e4 in sleepq_switch (wchan=0x0) at /usr/src/sys/kern/subr_sleepqueue.c:406
#3  0xc067ab9e in sleepq_wait (wchan=0x0) at /usr/src/sys/kern/subr_sleepqueue.c:518
#4  0xc06627b6 in msleep (ident=0xc15e5c6c, mtx=0x0, priority=159, 
    wmesg=0xc08a5c44 "suspfs", timo=0) at /usr/src/sys/kern/kern_synch.c:228
#5  0xc06c06b8 in vn_start_write (vp=0x0, mpp=0xc15e5c00, flags=1)
    at /usr/src/sys/kern/vfs_vnops.c:995
#6  0xc07daeb4 in vnode_pager_putpages (object=0xc1990b58, m=0x0, count=4096, sync=12, 
    rtvals=0x0) at /usr/src/sys/vm/vnode_pager.c:969
#7  0xc07d2f42 in vm_pageout_flush (mc=0xd022bac0, count=1, flags=12) at vm_pager.h:147
#8  0xc07cd84b in vm_object_page_collect_flush (object=0xc1990b58, p=0xc10ddc40, 
    curgeneration=1, pagerflags=12) at /usr/src/sys/vm/vm_object.c:958
#9  0xc07cd2a3 in vm_object_page_clean (object=0xc1990b58, start=628, end=Unhandled dwarf expression opcode 0x93
)
    at /usr/src/sys/vm/vm_object.c:761
#10 0xc07cda77 in vm_object_sync (object=0xc1990b58, offset=2572288, size=4096, 
    syncio=0, invalidate=0) at /usr/src/sys/vm/vm_object.c:1026
#11 0xc07c9267 in vm_map_sync (map=0xc147a384, start=676794368, end=676798464, syncio=0, 
    invalidate=0) at /usr/src/sys/vm/vm_map.c:2149
#12 0xc07cb4e1 in msync (td=0x0, uap=0x0) at /usr/src/sys/vm/vm_mmap.c:465
#13 0xc0828553 in syscall (frame=
   [ snip ]

(Interesting kgdb buglet on line 9 of that trace, by the way.)

Based on a bit more snooping, it's a basic deadlock.  The sync() is
trying to lock the vnode at 0xc19af318.  That vnode is already locked by
innd. innd is waiting for the suspended filesystem, so won't free up
the vnode.

[chasing down owner of the vnode ]

(kgdb) print vp
$7 = (struct vnode *) 0xc19af318
(kgdb) print *vp
$8 = {v_interlock = {mtx_object = {lo_class = 0xc08f945c, 
      lo_name = 0xc08a5792 "vnode interlock", lo_type = 0xc08a5792 "vnode interlock", 
      lo_flags = 196608, lo_list = {tqe_next = 0xc19af210, tqe_prev = 0xc19af430}, 
      lo_witness = 0x0}, mtx_lock = 4, mtx_recurse = 0}, v_iflag = 1536, v_usecount = 3, 
  v_numoutput = 0, v_vxthread = 0x0, v_holdcnt = 4, v_cleanblkhd = {tqh_first = 0x0, 
    tqh_last = 0xc19af350}, v_cleanblkroot = 0x0, v_cleanbufcnt = 0, v_dirtyblkhd = {
    tqh_first = 0xc5b36570, tqh_last = 0xc5bb2634}, v_dirtyblkroot = 0xc5b36570, 
  v_dirtybufcnt = 4, v_vflag = 8, v_writecount = 1, v_object = 0xc1990b58, 
  v_lastw = 152, v_cstart = 152, v_lasta = 2275200, v_clen = 7, v_un = {
    vu_mountedhere = 0x0, vu_socket = 0x0, vu_spec = {vu_cdev = 0x0, vu_specnext = {
        sle_next = 0x0}}, vu_fifoinfo = 0x0}, v_freelist = {tqe_next = 0x0, 
    tqe_prev = 0x0}, v_nmntvnodes = {tqe_next = 0xc1a67420, tqe_prev = 0xc19af4b0}, 
  v_synclist = {le_next = 0xc1901528, le_prev = 0xc19fd6c8}, v_type = VREG, 
  v_tag = 0xc08a3f13 "ufs", v_data = 0xc199108c, v_lock = {lk_interlock = 0xc095e4cc, 
    lk_flags = 18875456, lk_sharecount = 0, lk_waitcount = 1, lk_exclusivecount = 1, 
    lk_prio = 80, lk_wmesg = 0xc08a3f13 "ufs", lk_timo = 6, lk_lockholder = 0xc1c16c00, 
    lk_newlock = 0x0}, v_vnlock = 0xc19af3c4, v_op = 0xc160c500, v_mount = 0xc15e5c00, 
  v_cache_src = {lh_first = 0x0}, v_cache_dst = {tqh_first = 0xc1986e9c, 
    tqh_last = 0xc1986eac}, v_id = 1413, v_dd = 0xc19af318, v_ddid = 0, 
  v_pollinfo = 0x0, v_label = 0x0, v_cachedfs = 1041, v_cachedid = 141320, 
  v_bsize = 16384}
(kgdb) print vp->v_lock.lk_lockholder
$9 = (struct thread *) 0xc1c16c00
(kgdb) print *vp->v_lock.lk_lockholder
$10 = {td_proc = 0xc1c4e1c4, td_ksegrp = 0xc1a1d5b0, td_plist = {tqe_next = 0x0, 
    tqe_prev = 0xc1c4e1d4}, td_kglist = {tqe_next = 0x0, tqe_prev = 0xc1a1d5bc}, 
  td_slpq = {tqe_next = 0xc2b55c00, tqe_prev = 0xc18b1b00}, td_lockq = {tqe_next = 0x0, 
    tqe_prev = 0x0}, td_runq = {tqe_next = 0x0, tqe_prev = 0xc08fb5ac}, td_selq = {
    tqh_first = 0x0, tqh_last = 0xc1c16c30}, td_sleepqueue = 0x0, 
  td_turnstile = 0xc1824a80, td_tid = 100188, td_flags = 0, td_inhibitors = 2, 
  td_pflags = 0, td_dupfd = 0, td_wchan = 0xc15e5c6c, td_wmesg = 0xc08a5c44 "suspfs", 
  td_lastcpu = 0 '\0', td_oncpu = 255 'ÿ', td_locks = 5360, td_blocked = 0x0, 
  td_ithd = 0x0, td_lockname = 0x0, td_contested = {lh_first = 0x0}, 
  td_sleeplocks = 0x0, td_intr_nesting_level = 0, td_pinned = 0, td_mailbox = 0x0, 
  td_ucred = 0xc241cb00, td_standin = 0x0, td_prticks = 0, td_upcall = 0x0, 
  td_sticks = 42112, td_uuticks = 0, td_usticks = 0, td_intrval = 0, td_oldsigmask = {
    __bits = {0, 0, 0, 0}}, td_sigmask = {__bits = {0, 0, 0, 0}}, td_siglist = {
    __bits = {0, 0, 0, 0}}, td_umtx = {tqe_next = 0x0, tqe_prev = 0x0}, 
  td_generation = 2863987, td_sigstk = {ss_sp = 0x0, ss_size = 0, ss_flags = 4}, 
  td_kflags = 0, td_xsig = 0, td_profil_addr = 0, td_profil_ticks = 0, 
  td_base_pri = 159 '\237', td_priority = 159 '\237', td_pcb = 0xd022bd90, 
  td_state = TDS_INHIBITED, td_retval = {0, 1}, td_slpcallout = {c_links = {sle = {
        sle_next = 0x0}, tqe = {tqe_next = 0x0, tqe_prev = 0xc5ad0718}}, 
    c_time = 112874447, c_arg = 0xc1c16c00, c_func = 0, c_flags = 8}, 
  td_frame = 0xd022bd38, td_kstack_obj = 0xc1b0d108, td_kstack = 3491930112, 
  td_kstack_pages = 2, td_altkstack_obj = 0x0, td_altkstack = 0, td_altkstack_pages = 0, 
  td_critnest = 1, td_md = {md_savecrit = 582}, td_sched = 0xc1c16d50}
(kgdb) print *vp->v_lock.lk_lockholder->td_proc 
$11 = {p_list = {le_next = 0xc1c1354c, le_prev = 0xc195e8d4}, p_ksegrps = {
    tqh_first = 0xc1a1d5b0, tqh_last = 0xc1a1d5b4}, p_threads = {tqh_first = 0xc1c16c00, 
    tqh_last = 0xc1c16c08}, p_suspended = {tqh_first = 0x0, tqh_last = 0xc1c4e1dc}, 
  p_ucred = 0xc241cb00, p_fd = 0xc1fe3300, p_fdtol = 0x0, p_stats = 0xc1fd9500, 
  p_limit = 0xc1fd7700, p_unused1 = 0x0, p_sigacts = 0xc1a85000, p_flag = 1, 
  p_sflag = 1, p_state = PRS_NORMAL, p_pid = 97648, p_hash = {le_next = 0x0, 
    le_prev = 0xc2cd3ca8}, p_pglist = {le_next = 0xc1c4e54c, le_prev = 0xc2a00c48}, 
  p_pptr = 0xc146ee20, p_sibling = {le_next = 0xc1c1354c, le_prev = 0xc195e934}, 
  p_children = {lh_first = 0xc1c4e54c}, p_mtx = {mtx_object = {lo_class = 0xc08f945c, 
      lo_name = 0xc089e844 "process lock", lo_type = 0xc089e844 "process lock", 
      lo_flags = 4390912, lo_list = {tqe_next = 0xc1c4e3f4, tqe_prev = 0xc1c4e07c}, 
      lo_witness = 0x0}, mtx_lock = 4, mtx_recurse = 0}, p_oppid = 0, 
  p_vmspace = 0xc147a384, p_swtime = 10816, p_realtimer = {it_interval = {tv_sec = 0, 
      tv_usec = 0}, it_value = {tv_sec = 0, tv_usec = 0}}, p_runtime = {sec = 457, 
    frac = 14633034536159281822}, p_uu = 128520517, p_su = 329272740, p_iu = 1, 
  p_uticks = 16437, p_sticks = 42112, p_iticks = 0, p_profthreads = 0, 
  p_maxthrwaits = 0, p_traceflag = 0, p_tracevp = 0x0, p_tracecred = 0x0, 
  p_textvp = 0xc19aa210, p_siglist = {__bits = {0, 0, 0, 0}}, p_lock = 0 '\0', 
  p_sigiolst = {slh_first = 0x0}, p_sigparent = 20, p_sig = 0, p_code = 0, p_stops = 0, 
  p_stype = 0, p_step = 0 '\0', p_pfsflags = 0 '\0', p_nlminfo = 0x0, p_aioinfo = 0x0, 
  p_singlethread = 0x0, p_suspcount = 0, p_xthread = 0x0, p_boundary_count = 0, 
  p_magic = 3203398350, p_comm = "innd\000tart\000ic\000\000\000\000\000\000\000", 
  p_pgrp = 0xc2a00c40, p_sysent = 0xc093a320, p_args = 0xc2a09380, 
  p_cpulimit = 9223372036854775807, p_nice = 0 '\0', p_xstat = 0, p_klist = {
    kl_lock = 0xc1c4e230, kl_list = {slh_first = 0x0}}, p_numthreads = 1, 
  p_numksegrps = 1, p_md = {md_ldt = 0xc1c569a0}, p_itcallout = {c_links = {sle = {
        sle_next = 0x0}, tqe = {tqe_next = 0x0, tqe_prev = 0x0}}, c_time = 0, 
    c_arg = 0x0, c_func = 0, c_flags = 8}, p_unused2 = 0x0, p_acflag = 1, p_ru = 0x0, 
  p_peers = 0x0, p_leader = 0xc1c4e1c4, p_emuldata = 0x0, p_label = 0x0, 
  p_sched = 0xc1c4e388}

OK, so why is inn holding a locked vnode while it's sleeping for the
suspend?  That's pretty clear -- there's a vn_lock(vp) which happens to
be the vnode in question (0xc19af318) in vm_object_sync(), line 1022.

Now for the really hard question:  Fixes.  The obvious (but not so
elegant) fix is for the snapshot creation code to run under Giant,
which will hold vm_object_sync out at the correct place.

But that's *adding* Giant, which seems like a bad idea.

Also, it seems like this bug should be easily reproduced by having
a process do a lot of msync() while another creates snapshots.  I
don't have a box that's scratch enough to try running that on,
though.

-- 
Steve Watt KD6GGD  PP-ASEL-IA          ICBM: 121W 56' 57.8" / 37N 20' 14.9"
 Internet: steve @ Watt.COM                         Whois: SW32
   Free time?  There's no such thing.  It just comes in varying prices...


More information about the freebsd-hackers mailing list