solaris assert: avl_is_empty(&dn -> dn_dbufs) panic

Justin T. Gibbs gibbs at scsiguy.com
Sun Aug 23 21:22:21 UTC 2015


Hi,

I'll need a little time to fully reload the context for these changes. However, reintroducing a blocking loop is not the right fix - it was a hack in the original code. :-) My hunch is that removing the assert is safe, but it would be nice to have a core dump to better understand why the list isn't empty.

--
Justin

> On Aug 21, 2015, at 12:38 PM, Xin Li <delphij at delphij.net> wrote:
> 
> Hi,
> 
> A quick glance at the changes suggests that Justin's changeset may be
> related.  The reasoning is here:
> 
> https://reviews.csiden.org/r/131/
> 
> Related Illumos ticket:
> 
> https://www.illumos.org/issues/5056
> 
> In dnode_evict_dbufs(), remove multiple passes over dn->dn_dbufs.
> This is possible now that objset eviction is asynchronously
> completed in a different context once dbuf eviction completes.
> 
> In the case of objset eviction, any dbufs held by children will
> be evicted via dbuf_rele_and_unlock() once their refcounts go
> to zero.  Even when objset eviction is not active, the ordering
> of the avl tree guarantees that children will be released before
> parents, allowing the parent's refcounts to naturally drop to
> zero before they are inspected in this single loop.
> 
> ====
> 
> So, upon return from dnode_evict_dbufs(), there could be some
> DB_EVICTING buffers on the AVL pending release and thus breaks the
> invariant.
> 
> Should we restore the loop where we yield briefly with the lock
> released, then reacquire and recheck?
> 
> Cheers,
> 
> On 08/21/15 10:48, Don Lewis wrote:
>> On 21 Aug, Don Lewis wrote:
>>> On 21 Aug, Don Lewis wrote:
>>>> I just started getting this panic:
>>>> 
>>>> solaris assert: avl_is_empty(&dn -> dn_dbufs), file:
>>>> /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dnode_sync.c,
>>>> line 495
>>>> 
>>>> System info:
>>>> FreeBSD zipper.catspoiler.org 11.0-CURRENT FreeBSD 11.0-CURRENT #25 r286923: Wed Aug 19 09:28:53 PDT 2015     dl at zipper.catspoiler.org:/usr/obj/usr/src/sys/GENERIC  amd64
>>>> 
>>>> My zfs pool has one mirrored vdev.  Scrub doesn't find any problems.
>>>> 
>>>> %zpool status
>>>>  pool: zroot
>>>> state: ONLINE
>>>>  scan: scrub repaired 0 in 2h58m with 0 errors on Fri Aug 21 00:44:52 2015
>>>> config:
>>>> 
>>>> 	NAME        STATE     READ WRITE CKSUM
>>>> 	zroot       ONLINE       0     0     0
>>>> 	  mirror-0  ONLINE       0     0     0
>>>> 	    ada0p3  ONLINE       0     0     0
>>>> 	    ada1p3  ONLINE       0     0     0
>>>> 
>>>> This panic is reproduceable and happens every time I use poudriere to
>>>> build ports using my 9.3-RELEASE amd64 jail and occurs at the end of the
>>>> poudriere run when it is unmounting filesystems.
>>>> 
>>>> [00:10:43] ====>> Stopping 4 builders
>>>> 93amd64-default-job-01: removed
>>>> 93amd64-default-job-01-n: removed
>>>> 93amd64-default-job-02: removed
>>>> 93amd64-default-job-02-n: removed
>>>> 93amd64-default-job-03: removed
>>>> 93amd64-default-job-03-n: removed
>>>> 93amd64-default-job-04: removed
>>>> 93amd64-default-job-04-n: removed
>>>> [00:10:46] ====>> Creating pkgng repository
>>>> Creating repository in /tmp/packages: 100%
>>>> Packing files for repository: 100%
>>>> [00:10:55] ====>> Committing packages to repository
>>>> [00:10:55] ====>> Removing old packages
>>>> [00:10:55] ====>> Built ports: devel/py-pymtbl net/sie-nmsg net/p5-Net-Nmsg net/axa
>>>> [93amd64-default] [2015-08-21_00h47m41s] [committing:] Queued: 4  Built: 4  Failed: 0  Skipped: 0  Ignored: 0  Tobuild: 0   Time: 00:10:53
>>>> [00:10:55] ====>> Logs: /var/poudriere/data/logs/bulk/93amd64-default/2015-08-21_00h47m41s
>>>> [00:10:55] ====>> Cleaning up
>>>> 93amd64-default: removed
>>>> 93amd64-default-n: removed
>>>> [00:10:55] ====>> Umounting file systems
>>>> Write failed: Broken pipe
>>>> 
>>>> Prior to that, I ran poudriere a number of times with a 10.2-STABLE
>>>> amd64 jail without incident.
>>>> 
>>>> I've kicked off a bunch of poudriere runs for other jails and
>>>> will check on it in the morning.
>>> 
>>> Died the same way after building ports on the first jail,
>>> 10.1-RELEASE amd64.
>>> 
>>> Since there have been some zfs commits since r286923, I upgraded to
>>> r286998 this morning and tried again with no better luck.  I got the
>>> same panic again.
>>> 
>>> This machine has mirrored swap, and even though I've done what
>>> gmirror(8) says to do in order to capture crash dumps, I've had no luck
>>> with that.  The dump is getting written, but savecore is unable to find
>>> it.
>> 
>> Not sure what is happening with savecore during boot, but I was able to
>> run it manually and collect the crash dump.
>> 
>> 
>> Unread portion of the kernel message buffer:
>> panic: solaris assert: avl_is_empty(&dn->dn_dbufs), file: /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dnode_sync.c, line: 495
>> cpuid = 1
>> KDB: stack backtrace:
>> db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0859e4e4e0
>> vpanic() at vpanic+0x189/frame 0xfffffe0859e4e560
>> panic() at panic+0x43/frame 0xfffffe0859e4e5c0
>> assfail() at assfail+0x1a/frame 0xfffffe0859e4e5d0
>> dnode_sync() at dnode_sync+0x6c8/frame 0xfffffe0859e4e6b0
>> dmu_objset_sync_dnodes() at dmu_objset_sync_dnodes+0x2b/frame 0xfffffe0859e4e6e0
>> dmu_objset_sync() at dmu_objset_sync+0x29e/frame 0xfffffe0859e4e7b0
>> dsl_pool_sync() at dsl_pool_sync+0x348/frame 0xfffffe0859e4e820
>> spa_sync() at spa_sync+0x442/frame 0xfffffe0859e4e910
>> txg_sync_thread() at txg_sync_thread+0x23d/frame 0xfffffe0859e4e9f0
>> fork_exit() at fork_exit+0x84/frame 0xfffffe0859e4ea30
>> fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0859e4ea30
>> --- trap 0, rip = 0, rsp = 0, rbp = 0 ---
>> KDB: enter: panic
>> 
>> 
>> 
>> (kgdb) bt
>> #0  doadump (textdump=0) at pcpu.h:221
>> #1  0xffffffff8037bb86 in db_fncall (dummy1=<value optimized out>, 
>>    dummy2=<value optimized out>, dummy3=<value optimized out>, 
>>    dummy4=<value optimized out>) at /usr/src/sys/ddb/db_command.c:568
>> #2  0xffffffff8037b941 in db_command (cmd_table=0x0)
>>    at /usr/src/sys/ddb/db_command.c:440
>> #3  0xffffffff8037b5d4 in db_command_loop ()
>>    at /usr/src/sys/ddb/db_command.c:493
>> #4  0xffffffff8037e18b in db_trap (type=<value optimized out>, code=0)
>>    at /usr/src/sys/ddb/db_main.c:251
>> #5  0xffffffff80a5b294 in kdb_trap (type=3, code=0, tf=<value optimized out>)
>>    at /usr/src/sys/kern/subr_kdb.c:654
>> #6  0xffffffff80e6a4b1 in trap (frame=0xfffffe0859e4e410)
>>    at /usr/src/sys/amd64/amd64/trap.c:540
>> #7  0xffffffff80e49f22 in calltrap ()
>>    at /usr/src/sys/amd64/amd64/exception.S:235
>> #8  0xffffffff80a5a96e in kdb_enter (why=0xffffffff81379010 "panic", 
>>    msg=0xffffffff80a60b60 "UH\211\ufffdAWAVATSH\203\ufffdPI\211\ufffdA\211\ufffdH\213\004%Py\ufffd\201H\211E\ufffd\201<%\ufffd\210\ufffd\201") at cpufunc.h:63
>> #9  0xffffffff80a1e2c9 in vpanic (fmt=<value optimized out>, 
>>    ap=<value optimized out>) at /usr/src/sys/kern/kern_shutdown.c:619
>> #10 0xffffffff80a1e333 in panic (fmt=0xffffffff81aafa90 "\004")
>>    at /usr/src/sys/kern/kern_shutdown.c:557
>> ---Type <return> to continue, or q <return> to quit---
>> #11 0xffffffff8240922a in assfail (a=<value optimized out>, 
>>    f=<value optimized out>, l=<value optimized out>)
>>    at /usr/src/sys/cddl/compat/opensolaris/kern/opensolaris_cmn_err.c:81
>> #12 0xffffffff820d4f78 in dnode_sync (dn=0xfffff8040b72d3d0, 
>>    tx=0xfffff8001598ec00)
>>    at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dnode_sync.c:495
>> #13 0xffffffff820c922b in dmu_objset_sync_dnodes (list=0xfffff80007712b90, 
>>    newlist=<value optimized out>, tx=<value optimized out>)
>>    at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_objset.c:1045
>> #14 0xffffffff820c8ede in dmu_objset_sync (os=0xfffff80007712800, 
>>    pio=<value optimized out>, tx=0xfffff8001598ec00)
>>    at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_objset.c:1163
>> #15 0xffffffff820e8e78 in dsl_pool_sync (dp=0xfffff80015676000, txg=2660975)
>>    at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_pool.c:536
>> #16 0xffffffff8210dca2 in spa_sync (spa=0xfffffe00089c6000, txg=2660975)
>>    at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c:6641
>> #17 0xffffffff8211843d in txg_sync_thread (arg=0xfffff80015676000)
>>    at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/txg.c:517
>> #18 0xffffffff809e47c4 in fork_exit (
>>    callout=0xffffffff82118200 <txg_sync_thread>, arg=0xfffff80015676000, 
>>    frame=0xfffffe0859e4ea40) at /usr/src/sys/kern/kern_fork.c:1006
>> ---Type <return> to continue, or q <return> to quit---
>> #19 0xffffffff80e4a45e in fork_trampoline ()
>>    at /usr/src/sys/amd64/amd64/exception.S:610
>> #20 0x0000000000000000 in ?? ()
>> Current language:  auto; currently minimal
>> 
>> _______________________________________________
>> freebsd-fs at freebsd.org mailing list
>> https://lists.freebsd.org/mailman/listinfo/freebsd-fs
>> To unsubscribe, send any mail to "freebsd-fs-unsubscribe at freebsd.org"
>> 
> 
> 
> -- 
> Xin LI <delphij at delphij.net>    https://www.delphij.net/
> FreeBSD - The Power to Serve!           Live free or die
> 



More information about the freebsd-fs mailing list