svn commit: r332860 - head/sys/kern

Tue Apr 24 17:40:07 UTC 2018

On Tue, Apr 24, 2018 at 01:24:30PM -0400, Jonathan T. Looney wrote:
> On Mon, Apr 23, 2018 at 6:04 PM, John Baldwin <jhb at freebsd.org> wrote:
> >
> > I think this is actually a key question.  In my experience to date I have
> not
> > encountered a large number of post-panic assertion failures.  Given that
> > we already break all locks and disable assertions for locks I'd be curious
> > which assertions are actually failing.  My inclination given my
> experiences
> > to date would be to explicitly ignore those as we do for locking if it is
> > constrained set rather than blacklisting all of them.  However, I would be
> > most interested in seeing some examples of assertions that are failing.
> 
> The latest example (the one that prompted me to finally commit this) is in
> lockmgr_sunlock_try(): 'panic: Assertion (*xp & ~LK_EXCLUSIVE_SPINNERS) ==
> LK_SHARERS_LOCK(1) failed at /usr/src/sys/kern/kern_lock.c:541'
> 
> I don't see any obvious recent changes that would have caused this, so this
> is probably a case where a change to another file suddenly made us trip
> over this assert.
> 
> And, that really illustrates my overall point:

Mine too. :)

Why is anything trying to acquire a lockmgr lock after a panic? What is
the stack? I suspect that CAM is completing non-dump CCBs after a panic,
which can cause deadlocks if the completion handler needs to perform a
TLB shootdown after destroying a mapping, for example. In fact, I had
forgotten that Isilon has some CAM patches which attempt to address this
because of the problems that such deadlocks had caused. I will work on
getting these reviewed and upstreamed.

> most assertions in
> general-use code have limited value after a panic.
>
> We expect developers to write high-quality assertions so we can catch bugs.
> This requires that they understand how their code will be used. However,
> once we've panic'd, many assumptions behind code change and the assertions
> are no longer valid. (And, sometimes, it is difficult for a developer to
> predict how these things will change in a panic situation.) We can either
> play whack-a-mole to modify assertions as we trip over them in our
> post-panic work, or we can switch to an opt-in model where we only check
> assertions which the developer actually intends to run post-panic.
> 
> Playing whack-a-mole seems like a Sisyphean task which will burn out
> developers and/or frustrate people who run INVARIANTS kernels. Switching to
> an opt-in model seems like the better long-term strategy.
> 
> Having said all of that, I am cognizant of at least two things:
> 1) Mark Johnston has done a lot of work in coredumps and thinks there are
> post-panic assertions that have value.
> 2) Until we have both agreement to switch our post-panic assertion paradigm
> and also infrastructure to allow developers to opt in, it probably is not
> wise to disable all assertions by default.
> 
> So, I will follow Mark's suggestions: I will change the default. I will
> also change the code so we print a limited number of failed assertions.

Thanks.

> However, I think that changing the post-panic assertion paradigm is an
> important conversation to have. We want people to run our INVARIANTS
> kernels. And, we want to get high-quality reports from those. I think we
> could better serve those goals by changing the post-panic assertion
> paradigm.
> 
> Jonathan