race conditions for destroying and opening a dev

Thu Sep 16 19:45:50 UTC 2010

On Thu, Sep 16, 2010 at 12:00 PM, Matthew Jacob <mj at feral.com> wrote:
>
> Has anyone seen this scenario before? I am seeing it in RELENG_7, but the
> code in question exists through to head.
>
> Thread 1:
>
> (kgdb) where
> #0  sched_switch (td=0xffffff003a04ea80, newtd=0xffffff00210b4000,
> flags=Variable "flags" is not available.
> ) at ../../../kern/sched_ule.c:1944
> #1  0xffffffff803b6091 in mi_switch (flags=1, newtd=0x0) at
> ../../../kern/kern_synch.c:450
> #2  0xffffffff80402399 in sleepq_switch (wchan=0xffffff8413b50b60) at
> ../../../kern/subr_sleepqueue.c:497
> #3  0xffffffff80402e8c in sleepq_timedwait (wchan=0xffffff8413b50b60) at
> ../../../kern/subr_sleepqueue.c:615
> #4  0xffffffff803b682d in _sleep (ident=0xffffff8413b50b60,
> lock=0xffffffff80b0ee00, priority=76, wmesg=0xffffffff806583bb "devdrn",
> timo=100) at ../../../kern/kern_synch.c:228
> #5  0xffffffff8037640c in destroy_devl (dev=0xffffff003aaf0000) at
> ../../../kern/kern_conf.c:874
> #6  0xffffffff80376759 in destroy_dev (dev=0xffffff003aaf0000) at
> ../../../kern/kern_conf.c:916
> #7  0xffffffff8034c939 in g_dev_orphan (cp=0xffffff003a544800) at
> ../../../geom/geom_dev.c:438
> #8  0xffffffff803506a0 in g_run_events () at ../../../geom/geom_event.c:164
> #9  0xffffffff80351f1c in g_event_procbody () at
> ../../../geom/geom_kern.c:141
> #10 0xffffffff8038a73a in fork_exit (callout=0xffffffff80351eb0
> <g_event_procbody at ../../../geom/geom_kern.c:132>, arg=0x0,
> frame=0xffffff8413b50c80) at ../../../kern/kern_fork.c:829
> #11 0xffffffff805a747e in fork_trampoline () at
> ../../../amd64/amd64/exception.S:564
> #12 0x0000000000000000 in ?? ()
>
> This thread is waiting on the threadcount to go away- i.e., the last close
> of the device to occur ("da16" in this case).
>
> Thread 2:
>
> (kgdb) where
> #0  sched_switch (td=0xffffff009bb4ca80, newtd=0xffffff003af43380,
> flags=Variable "flags" is not available.
> ) at ../../../kern/sched_ule.c:1944
> #1  0xffffffff803b6091 in mi_switch (flags=1, newtd=0x0) at
> ../../../kern/kern_synch.c:450
> #2  0xffffffff80402399 in sleepq_switch (wchan=0xffffffff80b0e040) at
> ../../../kern/subr_sleepqueue.c:497
> #3  0xffffffff80402f84 in sleepq_wait (wchan=0xffffffff80b0e040) at
> ../../../kern/subr_sleepqueue.c:580
> #4  0xffffffff803b5385 in _sx_xlock_hard (sx=0xffffffff80b0e040,
> tid=18446742976810240640, opts=Variable "opts" is not available.
> ) at ../../../kern/kern_sx.c:562
> #5  0xffffffff803b5731 in _sx_xlock (sx=0xffffffff80b0e040, opts=0,
> file=0xffffffff80652d27 "../../../geom/geom_dev.c", line=196) at sx.h:154
> #6  0xffffffff8034d1bc in g_dev_open (dev=0xffffff003aaf0000, flags=1,
> fmt=Variable "fmt" is not available.
> ) at ../../../geom/geom_dev.c:196
> #7  0xffffffff80333741 in devfs_open (ap=0xffffff841dea88b0) at
> ../../../fs/devfs/devfs_vnops.c:902
> #8  0xffffffff80601daf in VOP_OPEN_APV (vop=0xffffffff8089fb80,
> a=0xffffff841dea88b0) at vnode_if.c:371
> #9  0xffffffff80467246 in vn_open_cred (ndp=0xffffff841dea8a00,
> flagp=0xffffff841dea894c, cmode=Variable "cmode" is not available.
> ) at vnode_if.h:199
> #10 0xffffffff80463770 in kern_open (td=0xffffff009bb4ca80, path=0x5114a0
> <Address 0x5114a0 out of bounds>, pathseg=Variable "pathseg" is not
> available.
> ) at ../../../kern/vfs_syscalls.c:1054
> #11 0xffffffff805c599e in syscall (frame=0xffffff841dea8c80) at
> ../../../amd64/amd64/trap.c:911
> #12 0xffffffff805a723b in Xfast_syscall () at
> ../../../amd64/amd64/exception.S:349
> #13 0x00000008009a219c in ?? ()
>
> This thread was opening the device, bumped the refcount, but then wedged on
> the geom topology lock .....
>
> the refcount field is protected under devmtx....
>
> Anyone seen this?
>
> I'm half inclined to either add in CDP_SCHED_DTR when one calls destroy_dev,
> or make dev_refthread look at CDP_ACTIVE, leaning more toward the latter.
>
> Any thoughts on this?

We had a similar bug at Isilon, but in our case it was in
cam/scsi/scsi_pass.c::passcleanup() calling destroy_dev().  We
switched it to destroy_dev_sched() to fix the si_threadcount deadlock.

Cheers,
matthew