svn commit: r332860 - head/sys/kern

Mon Apr 23 15:18:27 UTC 2018

Hi Mark,

Let me start by saying that I appreciate your well-reasoned response. (I
think) I understand your reasoning, I appreciate your well-explained
argument, and I respect your opinion. I just wanted to make that clear up
front.

On Sun, Apr 22, 2018 at 1:11 PM, Mark Johnston <markj at freebsd.org> wrote:
>
> > All too often, my ability to debug assertion violations is hindered
because
> > the system trips over yet another assertion while dumping the core. If
we
> > skip the assertion, nothing bad happens. (The post-panic debugging code
> > already needs to deal with systems that are inconsistent, and it does a
> > pretty good job at it.)
>
> I think we make a decent effort to fix such problems as they arise, but
> you are declaring defeat on behalf of everyone. Did you make some effort
> to fix or report these issues before resorting to the more drastic
> measure taken here?

We try to report or fix them as they arise. However, you don't know there
is a problem until you actually run into it. And, you don't run into the
problem until you can't get a core dump due to the assertion.

(And, with elusive problems, it isn't always easy to duplicate them. So,
fixing the assertion is sometimes "too late".)

> > On the other hand, I really am not sure what you are worried might
happen
> > if we skip checking assertions after we've already panic'd. As far as I
can
> > tell, the likely worst case is that we hit a true panic of some kind. In
> > that case, we're no worse off than before.
> >
> > I think the one obvious exception is when we're purposely trying to
> > validate the post-panic debugging code. In that case, you can change the
> > sysctl/tunable to enable troubleshooting.
>
> What about a user whose test system panics and fails to dump? With
> assertions enabled, a developer has a better chance of spotting the
> problem. Now we need at least one extra round trip to the user to
> diagnose the problem, which may not be readily reproducible in the first
> place.

That's true. However, this is equally true in the other direction: Prior to
this change, when a user tripped over an assertion and was unable to get a
coredump due to a second post-panic assertion, it took (at least) another
round-trip to get a coredump.

First, without the capability to ignore assertions after a panic
(introduced by this commit), you would need to fix the actual assertion to
enable the user to get a coredump. At minimum, I think this change has
value in that case. This change gives you a mechanism to get a coredump
without requiring that you fix the assertion and get the user to recompile
with the patch.

But, moreover, if we change the default back to panic'ing on a second
assertion, we will hamper our ability to get usable reports about elusive
bugs. If we leave the default "as is", it won't take an extra round-trip to
tell the user how to get a coredump. If we change the default (or, perhaps
more correctly, "restore the prior default"), we will still need a second
round-trip to get coredumps. That makes it tough to chase elusive bugs.

> > I would honestly appreciate someone explaining the dangers in disabling
a
> > response to assertion violations after we've already panic'd and are
simply
> > trying to troubleshoot, because they are not obvious to me. But, I could
> > simply be missing them.
>
> The assertions help identify code that is being executed during a dump
> when it shouldn't be. In general we rely on users to opt in to running
> INVARIANTS kernels because developers don't catch every single bug. With
> this change it's harder to be confident in the kernel dump code. (Or in
> any post-panic debugging code for that matter.)

I can appreciate that. I am generally skeptical of the value of assertions
in general-use code after a panic, since we already know the system is in
an inconsistent/unexpected state. And, it is hard to predict all the
various ways it could possibly be broken. However, I do recognize that
there is code which specifically is written to run post-panic, and which
has assertions which SHOULD be true, even after a panic.

> I dislike the change and would prefer the default to be inverted. At the
> very least I think we should print the assertion message rather than
> returning silently from kassert_panic().

I still think this change has value (as described above). I can understand
the argument for changing the default. In fact, after thinking about your
email, I'm leaning towards doing that. But, I want to ponder it more today.

If we leave the default alone, I agree we should print the assertion
message (albeit with some rate limit).

Jonathan