snapshots and innds

Mon May 23 05:17:48 GMT 2005

Excellent detective work on your part. The invarient that is being
broken here is that you are never supposed to hold a vnode locked
when you call vn_start_write. The call to vn_start_write should
be done in vm_object_sync before acquiring the vnode lock rather
than later in vnode_pager_putpages. Of course, moving the
vn_start_write out of vnode_pager_putpages means that we have to
track down every other caller of vnode_pager_putpages to make sure
that they have also done the vn_start_write call as well.

Jeff Robertson has come up with a much cleaner way of dealing with
the suspension code that I believe he is using in the -current tree.
It puts a hook in the ufs_lock code that tracks the number of locks
held in each filesystem. To do a suspend, it blocks all new lock
requests on that filesystem by any thread that does not already
hold a lock and waits for all the existing locks to be released.
This obviates the need for the vn_start_write calls sprinkled all
through the system. I have copied Jeff on this email so that he
can comment further on this issue as he is much more up to speed
on it at the moment than I am.

	Kirk McKusick

=-=-=-=-=-=-=

From: steve at Watt.COM (Steve Watt)
Date: Sun, 22 May 2005 14:02:39 -0700
In-Reply-To: steve at wattres.Watt.COM (Steve Watt)
       "snapshots and innds" (Dec 18, 17:39)
To: freebsd-hackers at freebsd.org
Subject: Re: snapshots and innds
Cc: mckusick at mckusick.com
X-Archived: 1116795759.948751269 at wattres.Watt.COM
X-ASK-Info: Whitelist match [from steve at watt.com] (2005/05/22 14:03:00)

[ OK, there's a lot of text in here, but I have definitively found a
  deadlock between ffs_mksnap and msync(). ]

Waaay back on Dec 18, 17:39, I wrote:
} Subject: snapshots and innds
} I'm getting a strong hunch that snapshots and inn don't get along
} well, presumably having something to do with inn's extensive use
} of mmap().
} 
} Just for an example, my system panic()ed earlier today (different
} problem) and during the reboot, I'm stuck with an fsck_ufs on wchan
} "ufs" and innd on wchan "suspfs", and neither of them responding
} in any way.

And I have been seeing hangs periodically since December that all
seem to implicate innd(msync()) arguing with dump(mksnap_ffs).

The system is 5.4-STABLE, updated last on the (PDT) morning of 2 May.

Finally, this morning, I got a kernel core dump that I can do useful
stuff with.  The system was mostly operating normally, except that
any attempt to access the /news partition (which has articles,
tradspool.map, overviews, and incoming/outgoing data) would get
stuck in "suspfs".

So I forced a dump from ddb.  The mount point does (as one would
expect) have MNTK_SUSPEND set.

I see mksnap_ffs sitting waiting for "ufs" (really vnode 0xc19af318),
which it got to via:

(kgdb) info stack
#0  sched_switch (td=0xc1ede780, newtd=0xc146f480, flags=1)
    at /usr/src/sys/kern/sched_4bsd.c:882
#1  0xc0662ad0 in mi_switch (flags=1, newtd=0x0) at /usr/src/sys/kern/kern_synch.c:355
#2  0xc067a9e4 in sleepq_switch (wchan=0x0) at /usr/src/sys/kern/subr_sleepqueue.c:406
#3  0xc067ab9e in sleepq_wait (wchan=0x0) at /usr/src/sys/kern/subr_sleepqueue.c:518
#4  0xc06627b6 in msleep (ident=0xc19af3c4, mtx=0xc095e4cc, priority=80, 
    wmesg=0xc08a3f13 "ufs", timo=0) at /usr/src/sys/kern/kern_synch.c:228
#5  0xc06505d6 in acquire (lkpp=0xd02df680, extflags=16777280, wanted=1536)
    at /usr/src/sys/kern/kern_lock.c:161
#6  0xc0650a14 in lockmgr (lkp=0xc19af3c4, flags=16842754, interlkp=0x0, td=0xc1ede780)
    at /usr/src/sys/kern/kern_lock.c:389
#7  0xc07bd6e3 in ufs_lock (ap=0xd02df6bc) at /usr/src/sys/ufs/ufs/ufs_vnops.c:2007
#8  0xc07be380 in ufs_vnoperate (ap=0x0) at /usr/src/sys/ufs/ufs/ufs_vnops.c:2828
#9  0xc06c0501 in vn_lock (vp=0xc19af318, flags=65538, td=0xc1ede780) at vnode_if.h:1013
#10 0xc06b4195 in vget (vp=0xc19af318, flags=65538, td=0x0)
    at /usr/src/sys/kern/vfs_subr.c:2028
#11 0xc07af408 in ffs_sync (mp=0xc15e5c00, waitfor=1, cred=0xc2953080, td=0xc1ede780)
    at /usr/src/sys/ufs/ffs/ffs_vfsops.c:1151
#12 0xc06c0840 in vfs_write_suspend (mp=0xc15e5c00) at /usr/src/sys/kern/vfs_vnops.c:1084
#13 0xc079db18 in ffs_snapshot (mp=0xc15e5c00, 
    snapfile=0xbfbfef1b <Address 0xbfbfef1b out of bounds>)
    at /usr/src/sys/ufs/ffs/ffs_snapshot.c:317
#14 0xc07ad5d8 in ffs_omount (mp=0xc15e5c00, path=0xc2a8c380 "/news", data=0x0, 
    td=0xc1ede780) at /usr/src/sys/ufs/ffs/ffs_vfsops.c:313
#15 0xc06af787 in vfs_domount (td=0xc1ede780, fstype=0xc1eea730 "ffs", 
    fspath=0xc2a8c380 "/news", fsflags=18944000, fsdata=0xbfbfe7d4, compat=1)
    at /usr/src/sys/kern/vfs_mount.c:861
#16 0xc06aef16 in mount (td=0x0, uap=0xd02dfd04) at /usr/src/sys/kern/vfs_mount.c:620
#17 0xc0828553 in syscall (frame=
   [ snip ]

And inn is sitting waiting for the suspended filesystem:
(kgdb) info stack
#0  sched_switch (td=0xc1c16c00, newtd=0xc1ede780, flags=1)
    at /usr/src/sys/kern/sched_4bsd.c:882
#1  0xc0662ad0 in mi_switch (flags=1, newtd=0x0) at /usr/src/sys/kern/kern_synch.c:355
#2  0xc067a9e4 in sleepq_switch (wchan=0x0) at /usr/src/sys/kern/subr_sleepqueue.c:406
#3  0xc067ab9e in sleepq_wait (wchan=0x0) at /usr/src/sys/kern/subr_sleepqueue.c:518
#4  0xc06627b6 in msleep (ident=0xc15e5c6c, mtx=0x0, priority=159, 
    wmesg=0xc08a5c44 "suspfs", timo=0) at /usr/src/sys/kern/kern_synch.c:228
#5  0xc06c06b8 in vn_start_write (vp=0x0, mpp=0xc15e5c00, flags=1)
    at /usr/src/sys/kern/vfs_vnops.c:995
#6  0xc07daeb4 in vnode_pager_putpages (object=0xc1990b58, m=0x0, count=4096, sync=12, 
    rtvals=0x0) at /usr/src/sys/vm/vnode_pager.c:969
#7  0xc07d2f42 in vm_pageout_flush (mc=0xd022bac0, count=1, flags=12) at vm_pager.h:147
#8  0xc07cd84b in vm_object_page_collect_flush (object=0xc1990b58, p=0xc10ddc40, 
    curgeneration=1, pagerflags=12) at /usr/src/sys/vm/vm_object.c:958
#9  0xc07cd2a3 in vm_object_page_clean (object=0xc1990b58, start=628, end=Unhandled dwarf expression opcode 0x93
)
    at /usr/src/sys/vm/vm_object.c:761
#10 0xc07cda77 in vm_object_sync (object=0xc1990b58, offset=2572288, size=4096, 
    syncio=0, invalidate=0) at /usr/src/sys/vm/vm_object.c:1026
#11 0xc07c9267 in vm_map_sync (map=0xc147a384, start=676794368, end=676798464, syncio=0, 
    invalidate=0) at /usr/src/sys/vm/vm_map.c:2149
#12 0xc07cb4e1 in msync (td=0x0, uap=0x0) at /usr/src/sys/vm/vm_mmap.c:465
#13 0xc0828553 in syscall (frame=
   [ snip ]

(Interesting kgdb buglet on line 9 of that trace, by the way.)

Based on a bit more snooping, it's a basic deadlock.  The sync() is
trying to lock the vnode at 0xc19af318.  That vnode is already locked by
innd. innd is waiting for the suspended filesystem, so won't free up
the vnode.

[chasing down owner of the vnode ]

(kgdb) print vp
$7 = (struct vnode *) 0xc19af318
(kgdb) print *vp
$8 = {v_interlock = {mtx_object = {lo_class = 0xc08f945c, 
      lo_name = 0xc08a5792 "vnode interlock", lo_type = 0xc08a5792 "vnode interlock", 
      lo_flags = 196608, lo_list = {tqe_next = 0xc19af210, tqe_prev = 0xc19af430}, 
      lo_witness = 0x0}, mtx_lock = 4, mtx_recurse = 0}, v_iflag = 1536, v_usecount = 3, 
  v_numoutput = 0, v_vxthread = 0x0, v_holdcnt = 4, v_cleanblkhd = {tqh_first = 0x0, 
    tqh_last = 0xc19af350}, v_cleanblkroot = 0x0, v_cleanbufcnt = 0, v_dirtyblkhd = {
    tqh_first = 0xc5b36570, tqh_last = 0xc5bb2634}, v_dirtyblkroot = 0xc5b36570, 
  v_dirtybufcnt = 4, v_vflag = 8, v_writecount = 1, v_object = 0xc1990b58, 
  v_lastw = 152, v_cstart = 152, v_lasta = 2275200, v_clen = 7, v_un = {
    vu_mountedhere = 0x0, vu_socket = 0x0, vu_spec = {vu_cdev = 0x0, vu_specnext = {
        sle_next = 0x0}}, vu_fifoinfo = 0x0}, v_freelist = {tqe_next = 0x0, 
    tqe_prev = 0x0}, v_nmntvnodes = {tqe_next = 0xc1a67420, tqe_prev = 0xc19af4b0}, 
  v_synclist = {le_next = 0xc1901528, le_prev = 0xc19fd6c8}, v_type = VREG, 
  v_tag = 0xc08a3f13 "ufs", v_data = 0xc199108c, v_lock = {lk_interlock = 0xc095e4cc, 
    lk_flags = 18875456, lk_sharecount = 0, lk_waitcount = 1, lk_exclusivecount = 1, 
    lk_prio = 80, lk_wmesg = 0xc08a3f13 "ufs", lk_timo = 6, lk_lockholder = 0xc1c16c00, 
    lk_newlock = 0x0}, v_vnlock = 0xc19af3c4, v_op = 0xc160c500, v_mount = 0xc15e5c00, 
  v_cache_src = {lh_first = 0x0}, v_cache_dst = {tqh_first = 0xc1986e9c, 
    tqh_last = 0xc1986eac}, v_id = 1413, v_dd = 0xc19af318, v_ddid = 0, 
  v_pollinfo = 0x0, v_label = 0x0, v_cachedfs = 1041, v_cachedid = 141320, 
  v_bsize = 16384}
(kgdb) print vp->v_lock.lk_lockholder
$9 = (struct thread *) 0xc1c16c00
(kgdb) print *vp->v_lock.lk_lockholder
$10 = {td_proc = 0xc1c4e1c4, td_ksegrp = 0xc1a1d5b0, td_plist = {tqe_next = 0x0, 
    tqe_prev = 0xc1c4e1d4}, td_kglist = {tqe_next = 0x0, tqe_prev = 0xc1a1d5bc}, 
  td_slpq = {tqe_next = 0xc2b55c00, tqe_prev = 0xc18b1b00}, td_lockq = {tqe_next = 0x0, 
    tqe_prev = 0x0}, td_runq = {tqe_next = 0x0, tqe_prev = 0xc08fb5ac}, td_selq = {
    tqh_first = 0x0, tqh_last = 0xc1c16c30}, td_sleepqueue = 0x0, 
  td_turnstile = 0xc1824a80, td_tid = 100188, td_flags = 0, td_inhibitors = 2, 
  td_pflags = 0, td_dupfd = 0, td_wchan = 0xc15e5c6c, td_wmesg = 0xc08a5c44 "suspfs", 
  td_lastcpu = 0 '\0', td_oncpu = 255 'ÿ', td_locks = 5360, td_blocked = 0x0, 
  td_ithd = 0x0, td_lockname = 0x0, td_contested = {lh_first = 0x0}, 
  td_sleeplocks = 0x0, td_intr_nesting_level = 0, td_pinned = 0, td_mailbox = 0x0, 
  td_ucred = 0xc241cb00, td_standin = 0x0, td_prticks = 0, td_upcall = 0x0, 
  td_sticks = 42112, td_uuticks = 0, td_usticks = 0, td_intrval = 0, td_oldsigmask = {
    __bits = {0, 0, 0, 0}}, td_sigmask = {__bits = {0, 0, 0, 0}}, td_siglist = {
    __bits = {0, 0, 0, 0}}, td_umtx = {tqe_next = 0x0, tqe_prev = 0x0}, 
  td_generation = 2863987, td_sigstk = {ss_sp = 0x0, ss_size = 0, ss_flags = 4}, 
  td_kflags = 0, td_xsig = 0, td_profil_addr = 0, td_profil_ticks = 0, 
  td_base_pri = 159 '\237', td_priority = 159 '\237', td_pcb = 0xd022bd90, 
  td_state = TDS_INHIBITED, td_retval = {0, 1}, td_slpcallout = {c_links = {sle = {
        sle_next = 0x0}, tqe = {tqe_next = 0x0, tqe_prev = 0xc5ad0718}}, 
    c_time = 112874447, c_arg = 0xc1c16c00, c_func = 0, c_flags = 8}, 
  td_frame = 0xd022bd38, td_kstack_obj = 0xc1b0d108, td_kstack = 3491930112, 
  td_kstack_pages = 2, td_altkstack_obj = 0x0, td_altkstack = 0, td_altkstack_pages = 0, 
  td_critnest = 1, td_md = {md_savecrit = 582}, td_sched = 0xc1c16d50}
(kgdb) print *vp->v_lock.lk_lockholder->td_proc 
$11 = {p_list = {le_next = 0xc1c1354c, le_prev = 0xc195e8d4}, p_ksegrps = {
    tqh_first = 0xc1a1d5b0, tqh_last = 0xc1a1d5b4}, p_threads = {tqh_first = 0xc1c16c00, 
    tqh_last = 0xc1c16c08}, p_suspended = {tqh_first = 0x0, tqh_last = 0xc1c4e1dc}, 
  p_ucred = 0xc241cb00, p_fd = 0xc1fe3300, p_fdtol = 0x0, p_stats = 0xc1fd9500, 
  p_limit = 0xc1fd7700, p_unused1 = 0x0, p_sigacts = 0xc1a85000, p_flag = 1, 
  p_sflag = 1, p_state = PRS_NORMAL, p_pid = 97648, p_hash = {le_next = 0x0, 
    le_prev = 0xc2cd3ca8}, p_pglist = {le_next = 0xc1c4e54c, le_prev = 0xc2a00c48}, 
  p_pptr = 0xc146ee20, p_sibling = {le_next = 0xc1c1354c, le_prev = 0xc195e934}, 
  p_children = {lh_first = 0xc1c4e54c}, p_mtx = {mtx_object = {lo_class = 0xc08f945c, 
      lo_name = 0xc089e844 "process lock", lo_type = 0xc089e844 "process lock", 
      lo_flags = 4390912, lo_list = {tqe_next = 0xc1c4e3f4, tqe_prev = 0xc1c4e07c}, 
      lo_witness = 0x0}, mtx_lock = 4, mtx_recurse = 0}, p_oppid = 0, 
  p_vmspace = 0xc147a384, p_swtime = 10816, p_realtimer = {it_interval = {tv_sec = 0, 
      tv_usec = 0}, it_value = {tv_sec = 0, tv_usec = 0}}, p_runtime = {sec = 457, 
    frac = 14633034536159281822}, p_uu = 128520517, p_su = 329272740, p_iu = 1, 
  p_uticks = 16437, p_sticks = 42112, p_iticks = 0, p_profthreads = 0, 
  p_maxthrwaits = 0, p_traceflag = 0, p_tracevp = 0x0, p_tracecred = 0x0, 
  p_textvp = 0xc19aa210, p_siglist = {__bits = {0, 0, 0, 0}}, p_lock = 0 '\0', 
  p_sigiolst = {slh_first = 0x0}, p_sigparent = 20, p_sig = 0, p_code = 0, p_stops = 0, 
  p_stype = 0, p_step = 0 '\0', p_pfsflags = 0 '\0', p_nlminfo = 0x0, p_aioinfo = 0x0, 
  p_singlethread = 0x0, p_suspcount = 0, p_xthread = 0x0, p_boundary_count = 0, 
  p_magic = 3203398350, p_comm = "innd\000tart\000ic\000\000\000\000\000\000\000", 
  p_pgrp = 0xc2a00c40, p_sysent = 0xc093a320, p_args = 0xc2a09380, 
  p_cpulimit = 9223372036854775807, p_nice = 0 '\0', p_xstat = 0, p_klist = {
    kl_lock = 0xc1c4e230, kl_list = {slh_first = 0x0}}, p_numthreads = 1, 
  p_numksegrps = 1, p_md = {md_ldt = 0xc1c569a0}, p_itcallout = {c_links = {sle = {
        sle_next = 0x0}, tqe = {tqe_next = 0x0, tqe_prev = 0x0}}, c_time = 0, 
    c_arg = 0x0, c_func = 0, c_flags = 8}, p_unused2 = 0x0, p_acflag = 1, p_ru = 0x0, 
  p_peers = 0x0, p_leader = 0xc1c4e1c4, p_emuldata = 0x0, p_label = 0x0, 
  p_sched = 0xc1c4e388}

OK, so why is inn holding a locked vnode while it's sleeping for the
suspend?  That's pretty clear -- there's a vn_lock(vp) which happens to
be the vnode in question (0xc19af318) in vm_object_sync(), line 1022.

Now for the really hard question:  Fixes.  The obvious (but not so
elegant) fix is for the snapshot creation code to run under Giant,
which will hold vm_object_sync out at the correct place.

But that's *adding* Giant, which seems like a bad idea.

Also, it seems like this bug should be easily reproduced by having
a process do a lot of msync() while another creates snapshots.  I
don't have a box that's scratch enough to try running that on,
though.

-- 
Steve Watt KD6GGD  PP-ASEL-IA          ICBM: 121W 56' 57.8" / 37N 20' 14.9"
 Internet: steve @ Watt.COM                         Whois: SW32
   Free time?  There's no such thing.  It just comes in varying prices...