System, diagnose thyself: auto-documentation for crashes
John Baldwin
jhb at freebsd.org
Fri Aug 29 19:12:46 UTC 2008
On Friday 29 August 2008 11:13:57 am Kirk Strauser wrote:
> I was having flaky system problems that were driving me to
> distraction. Yesterday, I finally got a panic message with an
> instruction pointer, used addr2line to see that the failure was in
> uma_zfree_internal, searched Google, and learned that it was probably
> due to bad RAM. Half any hour later, memtest86 found the defective
> stick and the problem was solved.
>
> This led me to thinking, though: the OS already had all the
> information needed to figure out where the problem was. If there had
> been an explanation inside that function definition, FreeBSD could
> have automatically gone to the file, searched for that explanation,
> and told me why my system had probably crashed.
>
> I propose that we:
>
> 1) Settle on a standard comment format for metainformation. There are
> already standards like Doxygen if we didn't want to home-roll something.
>
> 2) Write a program that takes an instruction pointer and outputs the
> comment for the associated function.
>
> 3) Modify /etc/rc.d/savecore to run the program from #2.
>
> For instance, suppose the comments in sys/vm/uma_core.c looked like:
>
> /*
> * Frees an item to an INTERNAL zone or allocates a free bucket
> *
> * Arguments:
> * zone The zone to free to
> * item The item we're freeing
> * udata User supplied data for the dtor
> * skip Skip dtors and finis
> *
> * Failure:
> * Failures in this function are commonly due to defective RAM.
> */
> static void
> uma_zfree_internal(uma_zone_t zone, void *item, void *udata,
> enum zfreeskip skip, int flags)
> {
> ...
> }
>
> If I'd seen that failure message in my syslog, I would have avoided a
> few days of teeth gnashing. What do you think? I think something
> like this could be extremely useful. Benefits:
>
> - There would be zero impact on performance because it would only
> touch comments and not any running code whatsoever.
> - It would require minimal work.
> - It could be done incrementally. Document known common failure
> points and add others with time.
> - It wouldn't affect any other systems.
See /usr/sbin/crashinfo for a start. I have patches to enable it
from /etc/rc.d/savecore after generating a patch (still need to test them
though).
--
John Baldwin
More information about the freebsd-current
mailing list