solaris assert: avl_is_empty(&dn -> dn_dbufs) panic
Justin T. Gibbs
gibbs at scsiguy.com
Sun Aug 23 21:22:21 UTC 2015
Hi,
I'll need a little time to fully reload the context for these changes. However, reintroducing a blocking loop is not the right fix - it was a hack in the original code. :-) My hunch is that removing the assert is safe, but it would be nice to have a core dump to better understand why the list isn't empty.
--
Justin
> On Aug 21, 2015, at 12:38 PM, Xin Li <delphij at delphij.net> wrote:
>
> Hi,
>
> A quick glance at the changes suggests that Justin's changeset may be
> related. The reasoning is here:
>
> https://reviews.csiden.org/r/131/
>
> Related Illumos ticket:
>
> https://www.illumos.org/issues/5056
>
> In dnode_evict_dbufs(), remove multiple passes over dn->dn_dbufs.
> This is possible now that objset eviction is asynchronously
> completed in a different context once dbuf eviction completes.
>
> In the case of objset eviction, any dbufs held by children will
> be evicted via dbuf_rele_and_unlock() once their refcounts go
> to zero. Even when objset eviction is not active, the ordering
> of the avl tree guarantees that children will be released before
> parents, allowing the parent's refcounts to naturally drop to
> zero before they are inspected in this single loop.
>
> ====
>
> So, upon return from dnode_evict_dbufs(), there could be some
> DB_EVICTING buffers on the AVL pending release and thus breaks the
> invariant.
>
> Should we restore the loop where we yield briefly with the lock
> released, then reacquire and recheck?
>
> Cheers,
>
> On 08/21/15 10:48, Don Lewis wrote:
>> On 21 Aug, Don Lewis wrote:
>>> On 21 Aug, Don Lewis wrote:
>>>> I just started getting this panic:
>>>>
>>>> solaris assert: avl_is_empty(&dn -> dn_dbufs), file:
>>>> /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dnode_sync.c,
>>>> line 495
>>>>
>>>> System info:
>>>> FreeBSD zipper.catspoiler.org 11.0-CURRENT FreeBSD 11.0-CURRENT #25 r286923: Wed Aug 19 09:28:53 PDT 2015 dl at zipper.catspoiler.org:/usr/obj/usr/src/sys/GENERIC amd64
>>>>
>>>> My zfs pool has one mirrored vdev. Scrub doesn't find any problems.
>>>>
>>>> %zpool status
>>>> pool: zroot
>>>> state: ONLINE
>>>> scan: scrub repaired 0 in 2h58m with 0 errors on Fri Aug 21 00:44:52 2015
>>>> config:
>>>>
>>>> NAME STATE READ WRITE CKSUM
>>>> zroot ONLINE 0 0 0
>>>> mirror-0 ONLINE 0 0 0
>>>> ada0p3 ONLINE 0 0 0
>>>> ada1p3 ONLINE 0 0 0
>>>>
>>>> This panic is reproduceable and happens every time I use poudriere to
>>>> build ports using my 9.3-RELEASE amd64 jail and occurs at the end of the
>>>> poudriere run when it is unmounting filesystems.
>>>>
>>>> [00:10:43] ====>> Stopping 4 builders
>>>> 93amd64-default-job-01: removed
>>>> 93amd64-default-job-01-n: removed
>>>> 93amd64-default-job-02: removed
>>>> 93amd64-default-job-02-n: removed
>>>> 93amd64-default-job-03: removed
>>>> 93amd64-default-job-03-n: removed
>>>> 93amd64-default-job-04: removed
>>>> 93amd64-default-job-04-n: removed
>>>> [00:10:46] ====>> Creating pkgng repository
>>>> Creating repository in /tmp/packages: 100%
>>>> Packing files for repository: 100%
>>>> [00:10:55] ====>> Committing packages to repository
>>>> [00:10:55] ====>> Removing old packages
>>>> [00:10:55] ====>> Built ports: devel/py-pymtbl net/sie-nmsg net/p5-Net-Nmsg net/axa
>>>> [93amd64-default] [2015-08-21_00h47m41s] [committing:] Queued: 4 Built: 4 Failed: 0 Skipped: 0 Ignored: 0 Tobuild: 0 Time: 00:10:53
>>>> [00:10:55] ====>> Logs: /var/poudriere/data/logs/bulk/93amd64-default/2015-08-21_00h47m41s
>>>> [00:10:55] ====>> Cleaning up
>>>> 93amd64-default: removed
>>>> 93amd64-default-n: removed
>>>> [00:10:55] ====>> Umounting file systems
>>>> Write failed: Broken pipe
>>>>
>>>> Prior to that, I ran poudriere a number of times with a 10.2-STABLE
>>>> amd64 jail without incident.
>>>>
>>>> I've kicked off a bunch of poudriere runs for other jails and
>>>> will check on it in the morning.
>>>
>>> Died the same way after building ports on the first jail,
>>> 10.1-RELEASE amd64.
>>>
>>> Since there have been some zfs commits since r286923, I upgraded to
>>> r286998 this morning and tried again with no better luck. I got the
>>> same panic again.
>>>
>>> This machine has mirrored swap, and even though I've done what
>>> gmirror(8) says to do in order to capture crash dumps, I've had no luck
>>> with that. The dump is getting written, but savecore is unable to find
>>> it.
>>
>> Not sure what is happening with savecore during boot, but I was able to
>> run it manually and collect the crash dump.
>>
>>
>> Unread portion of the kernel message buffer:
>> panic: solaris assert: avl_is_empty(&dn->dn_dbufs), file: /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dnode_sync.c, line: 495
>> cpuid = 1
>> KDB: stack backtrace:
>> db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0859e4e4e0
>> vpanic() at vpanic+0x189/frame 0xfffffe0859e4e560
>> panic() at panic+0x43/frame 0xfffffe0859e4e5c0
>> assfail() at assfail+0x1a/frame 0xfffffe0859e4e5d0
>> dnode_sync() at dnode_sync+0x6c8/frame 0xfffffe0859e4e6b0
>> dmu_objset_sync_dnodes() at dmu_objset_sync_dnodes+0x2b/frame 0xfffffe0859e4e6e0
>> dmu_objset_sync() at dmu_objset_sync+0x29e/frame 0xfffffe0859e4e7b0
>> dsl_pool_sync() at dsl_pool_sync+0x348/frame 0xfffffe0859e4e820
>> spa_sync() at spa_sync+0x442/frame 0xfffffe0859e4e910
>> txg_sync_thread() at txg_sync_thread+0x23d/frame 0xfffffe0859e4e9f0
>> fork_exit() at fork_exit+0x84/frame 0xfffffe0859e4ea30
>> fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0859e4ea30
>> --- trap 0, rip = 0, rsp = 0, rbp = 0 ---
>> KDB: enter: panic
>>
>>
>>
>> (kgdb) bt
>> #0 doadump (textdump=0) at pcpu.h:221
>> #1 0xffffffff8037bb86 in db_fncall (dummy1=<value optimized out>,
>> dummy2=<value optimized out>, dummy3=<value optimized out>,
>> dummy4=<value optimized out>) at /usr/src/sys/ddb/db_command.c:568
>> #2 0xffffffff8037b941 in db_command (cmd_table=0x0)
>> at /usr/src/sys/ddb/db_command.c:440
>> #3 0xffffffff8037b5d4 in db_command_loop ()
>> at /usr/src/sys/ddb/db_command.c:493
>> #4 0xffffffff8037e18b in db_trap (type=<value optimized out>, code=0)
>> at /usr/src/sys/ddb/db_main.c:251
>> #5 0xffffffff80a5b294 in kdb_trap (type=3, code=0, tf=<value optimized out>)
>> at /usr/src/sys/kern/subr_kdb.c:654
>> #6 0xffffffff80e6a4b1 in trap (frame=0xfffffe0859e4e410)
>> at /usr/src/sys/amd64/amd64/trap.c:540
>> #7 0xffffffff80e49f22 in calltrap ()
>> at /usr/src/sys/amd64/amd64/exception.S:235
>> #8 0xffffffff80a5a96e in kdb_enter (why=0xffffffff81379010 "panic",
>> msg=0xffffffff80a60b60 "UH\211\ufffdAWAVATSH\203\ufffdPI\211\ufffdA\211\ufffdH\213\004%Py\ufffd\201H\211E\ufffd\201<%\ufffd\210\ufffd\201") at cpufunc.h:63
>> #9 0xffffffff80a1e2c9 in vpanic (fmt=<value optimized out>,
>> ap=<value optimized out>) at /usr/src/sys/kern/kern_shutdown.c:619
>> #10 0xffffffff80a1e333 in panic (fmt=0xffffffff81aafa90 "\004")
>> at /usr/src/sys/kern/kern_shutdown.c:557
>> ---Type <return> to continue, or q <return> to quit---
>> #11 0xffffffff8240922a in assfail (a=<value optimized out>,
>> f=<value optimized out>, l=<value optimized out>)
>> at /usr/src/sys/cddl/compat/opensolaris/kern/opensolaris_cmn_err.c:81
>> #12 0xffffffff820d4f78 in dnode_sync (dn=0xfffff8040b72d3d0,
>> tx=0xfffff8001598ec00)
>> at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dnode_sync.c:495
>> #13 0xffffffff820c922b in dmu_objset_sync_dnodes (list=0xfffff80007712b90,
>> newlist=<value optimized out>, tx=<value optimized out>)
>> at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_objset.c:1045
>> #14 0xffffffff820c8ede in dmu_objset_sync (os=0xfffff80007712800,
>> pio=<value optimized out>, tx=0xfffff8001598ec00)
>> at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_objset.c:1163
>> #15 0xffffffff820e8e78 in dsl_pool_sync (dp=0xfffff80015676000, txg=2660975)
>> at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_pool.c:536
>> #16 0xffffffff8210dca2 in spa_sync (spa=0xfffffe00089c6000, txg=2660975)
>> at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c:6641
>> #17 0xffffffff8211843d in txg_sync_thread (arg=0xfffff80015676000)
>> at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/txg.c:517
>> #18 0xffffffff809e47c4 in fork_exit (
>> callout=0xffffffff82118200 <txg_sync_thread>, arg=0xfffff80015676000,
>> frame=0xfffffe0859e4ea40) at /usr/src/sys/kern/kern_fork.c:1006
>> ---Type <return> to continue, or q <return> to quit---
>> #19 0xffffffff80e4a45e in fork_trampoline ()
>> at /usr/src/sys/amd64/amd64/exception.S:610
>> #20 0x0000000000000000 in ?? ()
>> Current language: auto; currently minimal
>>
>> _______________________________________________
>> freebsd-fs at freebsd.org mailing list
>> https://lists.freebsd.org/mailman/listinfo/freebsd-fs
>> To unsubscribe, send any mail to "freebsd-fs-unsubscribe at freebsd.org"
>>
>
>
> --
> Xin LI <delphij at delphij.net> https://www.delphij.net/
> FreeBSD - The Power to Serve! Live free or die
>
More information about the freebsd-fs
mailing list