kern/94261: [ufs2] deadlock between ffs_mksnap and mysnc

Wed Mar 8 23:20:05 PST 2006

>Number:         94261
>Category:       kern
>Synopsis:       [ufs2] deadlock between ffs_mksnap and mysnc
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    freebsd-bugs
>State:          open
>Quarter:        
>Keywords:       
>Date-Required:
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Thu Mar 09 07:20:03 GMT 2006
>Closed-Date:
>Last-Modified:
>Originator:     Steve Watt
>Release:        FreeBSD 5.5-PRERELEASE i386
>Organization:
Watt Consultants
>Environment:
System: FreeBSD wattres.Watt.COM 5.5-PRERELEASE FreeBSD 5.5-PRERELEASE #11: Wed Mar 1 10:54:09 PST 2006 root at wattres.Watt.COM:/usr/obj/usr/src/sys/WATTRES i386

>Description:

Per a thread on -hackers back in May '05 titled "snapshots and innds",
I'm finally getting around (sorry) to filing the PR.

I got a kernel core dump that I can do useful stuff with.  The system
was mostly operating normally, except that any attempt to access the
/news partition (which has articles, tradspool.map, overviews, and
incoming/outgoing data) would get stuck in "suspfs".

So I forced a dump from ddb.  The mount point does (as one would
expect) have MNTK_SUSPEND set.

I see mksnap_ffs sitting waiting for "ufs" (really vnode 0xc19af318),
which it got to via:

(kgdb) info stack
#0  sched_switch (td=0xc1ede780, newtd=0xc146f480, flags=1)
    at /usr/src/sys/kern/sched_4bsd.c:882
#1  0xc0662ad0 in mi_switch (flags=1, newtd=0x0) at /usr/src/sys/kern/kern_synch.c:355
#2  0xc067a9e4 in sleepq_switch (wchan=0x0) at /usr/src/sys/kern/subr_sleepqueue.c:406
#3  0xc067ab9e in sleepq_wait (wchan=0x0) at /usr/src/sys/kern/subr_sleepqueue.c:518
#4  0xc06627b6 in msleep (ident=0xc19af3c4, mtx=0xc095e4cc, priority=80, 
    wmesg=0xc08a3f13 "ufs", timo=0) at /usr/src/sys/kern/kern_synch.c:228
#5  0xc06505d6 in acquire (lkpp=0xd02df680, extflags=16777280, wanted=1536)
    at /usr/src/sys/kern/kern_lock.c:161
#6  0xc0650a14 in lockmgr (lkp=0xc19af3c4, flags=16842754, interlkp=0x0, td=0xc1ede780)
    at /usr/src/sys/kern/kern_lock.c:389
#7  0xc07bd6e3 in ufs_lock (ap=0xd02df6bc) at /usr/src/sys/ufs/ufs/ufs_vnops.c:2007
#8  0xc07be380 in ufs_vnoperate (ap=0x0) at /usr/src/sys/ufs/ufs/ufs_vnops.c:2828
#9  0xc06c0501 in vn_lock (vp=0xc19af318, flags=65538, td=0xc1ede780) at vnode_if.h:1013
#10 0xc06b4195 in vget (vp=0xc19af318, flags=65538, td=0x0)
    at /usr/src/sys/kern/vfs_subr.c:2028
#11 0xc07af408 in ffs_sync (mp=0xc15e5c00, waitfor=1, cred=0xc2953080, td=0xc1ede780)
    at /usr/src/sys/ufs/ffs/ffs_vfsops.c:1151
#12 0xc06c0840 in vfs_write_suspend (mp=0xc15e5c00) at /usr/src/sys/kern/vfs_vnops.c:1084
#13 0xc079db18 in ffs_snapshot (mp=0xc15e5c00, 
    snapfile=0xbfbfef1b <Address 0xbfbfef1b out of bounds>)
    at /usr/src/sys/ufs/ffs/ffs_snapshot.c:317
#14 0xc07ad5d8 in ffs_omount (mp=0xc15e5c00, path=0xc2a8c380 "/news", data=0x0, 
    td=0xc1ede780) at /usr/src/sys/ufs/ffs/ffs_vfsops.c:313
#15 0xc06af787 in vfs_domount (td=0xc1ede780, fstype=0xc1eea730 "ffs", 
    fspath=0xc2a8c380 "/news", fsflags=18944000, fsdata=0xbfbfe7d4, compat=1)
    at /usr/src/sys/kern/vfs_mount.c:861
#16 0xc06aef16 in mount (td=0x0, uap=0xd02dfd04) at /usr/src/sys/kern/vfs_mount.c:620
#17 0xc0828553 in syscall (frame=
   [ snip ]

And inn is sitting waiting for the suspended filesystem:
(kgdb) info stack
#0  sched_switch (td=0xc1c16c00, newtd=0xc1ede780, flags=1)
    at /usr/src/sys/kern/sched_4bsd.c:882
#1  0xc0662ad0 in mi_switch (flags=1, newtd=0x0) at /usr/src/sys/kern/kern_synch.c:355
#2  0xc067a9e4 in sleepq_switch (wchan=0x0) at /usr/src/sys/kern/subr_sleepqueue.c:406
#3  0xc067ab9e in sleepq_wait (wchan=0x0) at /usr/src/sys/kern/subr_sleepqueue.c:518
#4  0xc06627b6 in msleep (ident=0xc15e5c6c, mtx=0x0, priority=159, 
    wmesg=0xc08a5c44 "suspfs", timo=0) at /usr/src/sys/kern/kern_synch.c:228
#5  0xc06c06b8 in vn_start_write (vp=0x0, mpp=0xc15e5c00, flags=1)
    at /usr/src/sys/kern/vfs_vnops.c:995
#6  0xc07daeb4 in vnode_pager_putpages (object=0xc1990b58, m=0x0, count=4096, sync=12, 
    rtvals=0x0) at /usr/src/sys/vm/vnode_pager.c:969
#7  0xc07d2f42 in vm_pageout_flush (mc=0xd022bac0, count=1, flags=12) at vm_pager.h:147
#8  0xc07cd84b in vm_object_page_collect_flush (object=0xc1990b58, p=0xc10ddc40, 
    curgeneration=1, pagerflags=12) at /usr/src/sys/vm/vm_object.c:958
#9  0xc07cd2a3 in vm_object_page_clean (object=0xc1990b58, start=628, end=Unhandled dwarf 
expression opcode 0x93
)
    at /usr/src/sys/vm/vm_object.c:761
#10 0xc07cda77 in vm_object_sync (object=0xc1990b58, offset=2572288, size=4096, 
    syncio=0, invalidate=0) at /usr/src/sys/vm/vm_object.c:1026
#11 0xc07c9267 in vm_map_sync (map=0xc147a384, start=676794368, end=676798464, syncio=0, 
    invalidate=0) at /usr/src/sys/vm/vm_map.c:2149
#12 0xc07cb4e1 in msync (td=0x0, uap=0x0) at /usr/src/sys/vm/vm_mmap.c:465
#13 0xc0828553 in syscall (frame=
   [ snip ]

(Interesting kgdb buglet on line 9 of that trace, by the way.)

Based on a bit more snooping, it's a basic deadlock.  The sync() is
trying to lock the vnode at 0xc19af318.  That vnode is already locked by
innd. innd is waiting for the suspended filesystem, so won't free up
the vnode.

[chasing down owner of the vnode ]

(kgdb) print vp
$7 = (struct vnode *) 0xc19af318
(kgdb) print *vp
$8 = {v_interlock = {mtx_object = {lo_class = 0xc08f945c, 
      lo_name = 0xc08a5792 "vnode interlock", lo_type = 0xc08a5792 "vnode interlock", 
      lo_flags = 196608, lo_list = {tqe_next = 0xc19af210, tqe_prev = 0xc19af430}, 
      lo_witness = 0x0}, mtx_lock = 4, mtx_recurse = 0}, v_iflag = 1536, v_usecount = 3, 
  v_numoutput = 0, v_vxthread = 0x0, v_holdcnt = 4, v_cleanblkhd = {tqh_first = 0x0, 
    tqh_last = 0xc19af350}, v_cleanblkroot = 0x0, v_cleanbufcnt = 0, v_dirtyblkhd = {
    tqh_first = 0xc5b36570, tqh_last = 0xc5bb2634}, v_dirtyblkroot = 0xc5b36570, 
  v_dirtybufcnt = 4, v_vflag = 8, v_writecount = 1, v_object = 0xc1990b58, 
  v_lastw = 152, v_cstart = 152, v_lasta = 2275200, v_clen = 7, v_un = {
    vu_mountedhere = 0x0, vu_socket = 0x0, vu_spec = {vu_cdev = 0x0, vu_specnext = {
        sle_next = 0x0}}, vu_fifoinfo = 0x0}, v_freelist = {tqe_next = 0x0, 
    tqe_prev = 0x0}, v_nmntvnodes = {tqe_next = 0xc1a67420, tqe_prev = 0xc19af4b0}, 
  v_synclist = {le_next = 0xc1901528, le_prev = 0xc19fd6c8}, v_type = VREG, 
  v_tag = 0xc08a3f13 "ufs", v_data = 0xc199108c, v_lock = {lk_interlock = 0xc095e4cc, 
    lk_flags = 18875456, lk_sharecount = 0, lk_waitcount = 1, lk_exclusivecount = 1, 
    lk_prio = 80, lk_wmesg = 0xc08a3f13 "ufs", lk_timo = 6, lk_lockholder = 0xc1c16c00, 
    lk_newlock = 0x0}, v_vnlock = 0xc19af3c4, v_op = 0xc160c500, v_mount = 0xc15e5c00, 
  v_cache_src = {lh_first = 0x0}, v_cache_dst = {tqh_first = 0xc1986e9c, 
    tqh_last = 0xc1986eac}, v_id = 1413, v_dd = 0xc19af318, v_ddid = 0, 
  v_pollinfo = 0x0, v_label = 0x0, v_cachedfs = 1041, v_cachedid = 141320, 
  v_bsize = 16384}
(kgdb) print vp->v_lock.lk_lockholder
$9 = (struct thread *) 0xc1c16c00
(kgdb) print *vp->v_lock.lk_lockholder
$10 = {td_proc = 0xc1c4e1c4, td_ksegrp = 0xc1a1d5b0, td_plist = {tqe_next = 0x0, 
    tqe_prev = 0xc1c4e1d4}, td_kglist = {tqe_next = 0x0, tqe_prev = 0xc1a1d5bc}, 
  td_slpq = {tqe_next = 0xc2b55c00, tqe_prev = 0xc18b1b00}, td_lockq = {tqe_next = 0x0, 
    tqe_prev = 0x0}, td_runq = {tqe_next = 0x0, tqe_prev = 0xc08fb5ac}, td_selq = {
    tqh_first = 0x0, tqh_last = 0xc1c16c30}, td_sleepqueue = 0x0, 
  td_turnstile = 0xc1824a80, td_tid = 100188, td_flags = 0, td_inhibitors = 2, 
  td_pflags = 0, td_dupfd = 0, td_wchan = 0xc15e5c6c, td_wmesg = 0xc08a5c44 "suspfs", 
  td_lastcpu = 0 '\0', td_oncpu = 255 'ÿ', td_locks = 5360, td_blocked = 0x0, 
  td_ithd = 0x0, td_lockname = 0x0, td_contested = {lh_first = 0x0}, 
  td_sleeplocks = 0x0, td_intr_nesting_level = 0, td_pinned = 0, td_mailbox = 0x0, 
  td_ucred = 0xc241cb00, td_standin = 0x0, td_prticks = 0, td_upcall = 0x0, 
  td_sticks = 42112, td_uuticks = 0, td_usticks = 0, td_intrval = 0, td_oldsigmask = {
    __bits = {0, 0, 0, 0}}, td_sigmask = {__bits = {0, 0, 0, 0}}, td_siglist = {
    __bits = {0, 0, 0, 0}}, td_umtx = {tqe_next = 0x0, tqe_prev = 0x0}, 
  td_generation = 2863987, td_sigstk = {ss_sp = 0x0, ss_size = 0, ss_flags = 4}, 
  td_kflags = 0, td_xsig = 0, td_profil_addr = 0, td_profil_ticks = 0, 
  td_base_pri = 159 '\237', td_priority = 159 '\237', td_pcb = 0xd022bd90, 
  td_state = TDS_INHIBITED, td_retval = {0, 1}, td_slpcallout = {c_links = {sle = {
        sle_next = 0x0}, tqe = {tqe_next = 0x0, tqe_prev = 0xc5ad0718}}, 
    c_time = 112874447, c_arg = 0xc1c16c00, c_func = 0, c_flags = 8}, 
  td_frame = 0xd022bd38, td_kstack_obj = 0xc1b0d108, td_kstack = 3491930112, 
  td_kstack_pages = 2, td_altkstack_obj = 0x0, td_altkstack = 0, td_altkstack_pages = 0, 
  td_critnest = 1, td_md = {md_savecrit = 582}, td_sched = 0xc1c16d50}
(kgdb) print *vp->v_lock.lk_lockholder->td_proc 
$11 = {p_list = {le_next = 0xc1c1354c, le_prev = 0xc195e8d4}, p_ksegrps = {
    tqh_first = 0xc1a1d5b0, tqh_last = 0xc1a1d5b4}, p_threads = {tqh_first = 0xc1c16c00, 
    tqh_last = 0xc1c16c08}, p_suspended = {tqh_first = 0x0, tqh_last = 0xc1c4e1dc}, 
  p_ucred = 0xc241cb00, p_fd = 0xc1fe3300, p_fdtol = 0x0, p_stats = 0xc1fd9500, 
  p_limit = 0xc1fd7700, p_unused1 = 0x0, p_sigacts = 0xc1a85000, p_flag = 1, 
  p_sflag = 1, p_state = PRS_NORMAL, p_pid = 97648, p_hash = {le_next = 0x0, 
    le_prev = 0xc2cd3ca8}, p_pglist = {le_next = 0xc1c4e54c, le_prev = 0xc2a00c48}, 
  p_pptr = 0xc146ee20, p_sibling = {le_next = 0xc1c1354c, le_prev = 0xc195e934}, 
  p_children = {lh_first = 0xc1c4e54c}, p_mtx = {mtx_object = {lo_class = 0xc08f945c, 
      lo_name = 0xc089e844 "process lock", lo_type = 0xc089e844 "process lock", 
      lo_flags = 4390912, lo_list = {tqe_next = 0xc1c4e3f4, tqe_prev = 0xc1c4e07c}, 
      lo_witness = 0x0}, mtx_lock = 4, mtx_recurse = 0}, p_oppid = 0, 
  p_vmspace = 0xc147a384, p_swtime = 10816, p_realtimer = {it_interval = {tv_sec = 0, 
      tv_usec = 0}, it_value = {tv_sec = 0, tv_usec = 0}}, p_runtime = {sec = 457, 
    frac = 14633034536159281822}, p_uu = 128520517, p_su = 329272740, p_iu = 1, 
  p_uticks = 16437, p_sticks = 42112, p_iticks = 0, p_profthreads = 0, 
  p_maxthrwaits = 0, p_traceflag = 0, p_tracevp = 0x0, p_tracecred = 0x0, 
  p_textvp = 0xc19aa210, p_siglist = {__bits = {0, 0, 0, 0}}, p_lock = 0 '\0', 
  p_sigiolst = {slh_first = 0x0}, p_sigparent = 20, p_sig = 0, p_code = 0, p_stops = 0, 
  p_stype = 0, p_step = 0 '\0', p_pfsflags = 0 '\0', p_nlminfo = 0x0, p_aioinfo = 0x0, 
  p_singlethread = 0x0, p_suspcount = 0, p_xthread = 0x0, p_boundary_count = 0, 
  p_magic = 3203398350, p_comm = "innd\000tart\000ic\000\000\000\000\000\000\000", 
  p_pgrp = 0xc2a00c40, p_sysent = 0xc093a320, p_args = 0xc2a09380, 
  p_cpulimit = 9223372036854775807, p_nice = 0 '\0', p_xstat = 0, p_klist = {
    kl_lock = 0xc1c4e230, kl_list = {slh_first = 0x0}}, p_numthreads = 1, 
  p_numksegrps = 1, p_md = {md_ldt = 0xc1c569a0}, p_itcallout = {c_links = {sle = {
        sle_next = 0x0}, tqe = {tqe_next = 0x0, tqe_prev = 0x0}}, c_time = 0, 
    c_arg = 0x0, c_func = 0, c_flags = 8}, p_unused2 = 0x0, p_acflag = 1, p_ru = 0x0, 
  p_peers = 0x0, p_leader = 0xc1c4e1c4, p_emuldata = 0x0, p_label = 0x0, 
  p_sched = 0xc1c4e388}

OK, so why is inn holding a locked vnode while it's sleeping for the
suspend?  That's pretty clear -- there's a vn_lock(vp) which happens to
be the vnode in question (0xc19af318) in vm_object_sync(), line 1022.

Now for the really hard question:  Fixes.  The obvious (but not so
elegant) fix is for the snapshot creation code to run under Giant,
which will hold vm_object_sync out at the correct place.

But that's *adding* Giant, which seems like a bad idea.

Also, it seems like this bug should be easily reproduced by having
a process do a lot of msync() while another creates snapshots.  I
don't have a box that's scratch enough to try running that on,
though.

>How-To-Repeat:

This is somewhat difficult to repeat.  Running innd on a system that
gets backed up daily causes the deadlock roughly once evey few weeks.

>Fix:

Kirk wrote:
Excellent detective work on your part. The invarient that is being
broken here is that you are never supposed to hold a vnode locked
when you call vn_start_write. The call to vn_start_write should
be done in vm_object_sync before acquiring the vnode lock rather
than later in vnode_pager_putpages. Of course, moving the
vn_start_write out of vnode_pager_putpages means that we have to
track down every other caller of vnode_pager_putpages to make sure
that they have also done the vn_start_write call as well.

Jeff Robertson has come up with a much cleaner way of dealing with
the suspension code that I believe he is using in the -current tree.
It puts a hook in the ufs_lock code that tracks the number of locks
held in each filesystem. To do a suspend, it blocks all new lock
requests on that filesystem by any thread that does not already
hold a lock and waits for all the existing locks to be released.
This obviates the need for the vn_start_write calls sprinkled all
through the system. I have copied Jeff on this email so that he
can comment further on this issue as he is much more up to speed
on it at the moment than I am.

Then I wrote:
I'm trying to decide which is a simpler hack-around -- do the Giant
thing, or pull the lock up a level and grep.  Opinions?

To which Kirk replied:
I would definite advocate the pull the lock up a level and grep
option as that is the "correct" solution. If for whatever reason
Jeff decides to stick with the vn_start_write approach, it will
be a solved problem rather than one that needs fixing. Plus we
really don't want to regress on Giant if we can help it.

Naturally, I haven't found the time to get my pull-up working,
a bad combination of job changes and the minor detail that the
vnode code is a bit ... shall we say ... complex?

>Release-Note:
>Audit-Trail:
>Unformatted: