cvs commit: src/sys/alpha/alpha support.s src/sys/i386/i386 swtch.s src/sys/kern kern_shutdown.c src/sys/sys systm.h

Tue Jan 20 18:43:13 PST 2004

On Tue, 20 Jan 2004, Bill Paul wrote:

> [abuse of __FILE__ and __LINE__ in panic()]

> > >   Ideally a traceback should be printed too, any takers ?
> >
> > This can be obtained by running a debugger on the panic dump.  Line
> > numbers and source file names can also be printed by the debugger
> > if the executable has at least line numbers in its debugging info.
>
> That's fine for developers who always run their systems with crash
> dumps enabled and posess sufficient debugger fu to analyze them. What
> about ordinary users who encounter a kernel panic due to a trap
> and need to report it?

This is close to the best possible method for ordinary users.  I rarely
use it.  (I prefer to debug dynamic kernels).  But to the extent that
automatic crash dumps work, ordinary users always have them and you
just have to extract the info from them (the users).

> It would save wear and tear on everyone's
> nerves (_ESPECIALLY_ mine) if they could just send in a traceback
> rather than developers being forced to do the usual "go back and
> do nm ${KERNEL} and tell us what you see" exchange.

This is much more to ask of ordinary users.  The traceback is only
easy to provide if it is printed to a serial of other external console,
or recorded on disk, or someone fixes backtrace() and the message
buffer manages to live across panic reboots.  Otherwise the user has
to watch the panic and transcribe the data.  This is too much to ask
for, especially for verbose panic messages with lots of normally-useless
data like the the trap_fatal() register dump and traces with stack
frames for recursive panics (and now even line numbers for recursive
panics).  If the data can be recorded on disk, then panic dumps should
work and already contain the data (except the implementation of
backtrace() is of such quality that it doesn't write the data to the
message buffer).

> [...]
>
> > Verbose panic messages, and lots of code to print out values of variables
> > just before panicing, are another mistake.  Short panic messages were
> > good enough when debuggers were primitive or nonexistent.  Verbose panic
> > messages are even less needed now.
>
> Ok, hang on just a minute.
>
> Let's be clear here. Adding linenum/sourcefile info to all panic()
> calls is kind of pointless, I agree. (If you're too lazy to grep
> the source for the panic message, you're just no damn good.) However,
> there are times when printing the contents of some variables in
> the panic string is _INCREDIBLY_ _USEFUL_.

Those times are mostly when you have narrowed down the cause of a new
panic and know what to print.

> Terse panic messages
> are annoying -- for that matter so are terse error messages in general.
> Error messages are meant to a) inform the user that they need to
> perform some action to correct a problem and b) alert a software
> developer to a potential bug and, hopefully, how to fix it. It is
> not enough to note that an error occured: you have to explain
> _WHY_ it occured.

Panic messages are special because the user is not expected to understand
or debug them.

> Really bad panic message:
>
> 	panic("mbuf is corrupt (but I'm not telling you exactly "
> 	    "what's corrupt about it because god forbid I should "
> 	    "make life easier for you)");

Not really bad panic message:

	/*
	 * This message intentionally kept short so that it can be
	 * reported easily (especially if it must be transcribed
	 * manually).
	 */
	panic("mbuf_frobnicate: mbuf is corrupt");

> Good panic message:
>
> 	panic("mbuf is corrupt: mbuf %p has m_len of %d and m_flags of 0x%x "
> 	    "but m_data is NULL", m, m->m_len, m->m_flags);
>
> By showing the contents of the suspicious variable/structure, you give
> a developer some clue as to what was going on in the system when the
> variable/structure was trashed. Maybe m_len is the size of a protocol
> header that points to a particular protocol module being suspect.
> Maybe m_flags has M_DONGS set which points to code that Alfred wrote.
> It may not make sense to dump out every single piece of available state,
> but not providing any state at all is just damn rude.

You could get lucky, but especially for memory corruption problems it
is unlikely that the memory contents points to the cause of the corruption
that easily.

> Could you determine this from the crash dump? Sure -- but do you have
> any idea how much of a pain in the ass that can be, especially when
> the crash occurs on a system that isn't yours? By default, crash
> dumps are as large as physical memory, and most systems now have at
> least 128MB of RAM. The average user does not know how to analyze
> a crash dump, which means even if they get one, they won't be able to
> do anything with it, except send it to me. That means I'll have people
> throwing 128MB files at me, which I can't even analyze unless I happen
> to have a system handy running _exactly_ the same kernel as they are.

I have defense against the large files (a 56K connection :-).  You can
write a script to run gdb on their crash dump (if they have one) much
more easily than you can tell them how to run gdb.

> This is an amazing amount of trouble to go through to determine a
> couple of pieces of info that could just as easily have been printed
> on the console, where the user could copy them down and include them
> in an e-mail.

.. if the couple of pieces of info are relevant and aren't hidden in
many irrelevant lines of debugging info.

> Anyway, to reiterate: I don't think the lineno/sourcefile additions
> to panic() are worth the bloat. Tracebacks on the other hand, would
> be worth their weight in gold bikesheds.

I agree about tracebacks.  We already have them, but not in a form
useful to almost anyone without a serial console.

Bruce