Random panics with 5.3-REL, SMP
Robert Watson
rwatson at freebsd.org
Wed Nov 24 07:37:02 PST 2004
On Tue, 23 Nov 2004, Hogan Whittall wrote:
> I'm still getting random panics, however. Doesn't appear to be related
> to anything in particular and seems to usually happen after being up for
> 1-2 days. I've attempted to get a coredump but the last panic wedged
> while dumping to disk. I'm going to be out of town for a week and won't
> have access to the box, but if anyone has experienced something like
> this before and knows of a fix, please let me know. Here are the specs
> of the machine:
"Random panics" is a little vague as a starting point, but here are some
thoughts to look at when back from your vacation:
- Using a serial console to the box, you can reliably gather information
without the core dump mechanism working.
- "Random panics" could mean "A lot of seemingly different panics
happening with relatively frequency", or it might mean "A few similar
panics, happening at random intervals". It would be useful to clarify
which it is. Recognizing that you may not be familiar with the intimate
details of kernel failure modes, the ways in which one might classify
failures as being "similar" is by the nature of the panic and the stack
trace to reach the panic. Panics usually fall into two forms: an
explicit call to panic() by code that has detected a failure of a kernel
invariant ("this should never happen"), or a page fault ("the kernel
touched some memory it shouldn't have"). Panics typically print a fault
description, such as a pointer dereferenced, or the nature of the
invariant test that triggered. The same message might indicate the
same problem occuring. A stack trace can be generated using the "trace"
command in DDB, and is a subset of the information you might get by
pointing gdb at a core. If the stack traces look similar (especially
with regard to the functions close to the frame where the panic took
place), the failure mode might be regarded to be similar also.
Regardless, when reporting panics, the panic line or header of the fault
report are excellent starting points.
- In terms of debugging information, it would be very useful if you could
hook up a serial console, and when a panic occurs, send the output of
"show pcpu" and "show trace". If an SMP box with an SMP kernel, run
"show pcpu" for each cpu, and trace the active threads on each. The
output of "ps" is usually pretty valuable, as it will show what the
system was doing, and if many threads are waiting fore something, it
will show what they are waiting for. With file system related panics
or hangs, the output of "show lockedvnods" is often also very useful, as
it will show what file system objects were being actively used, and by
what threads. If running with WITNESS (see below), "show locks" can be
very helpful, as it will assist in understanding and debugging the
synchronization state of the kernel.
- If a bug leads to an eventual panic, that problem caused by the bug will
sometimes be better described if you have some of the kernel debugging
kernel enabled. For example, INVARIANTS and/or WITNESS. Depending on
the impact to performance you can take on the box, you might want to try
some features, then others. Features like INVARIANTS may also help
catch the problem earlier, making the problem easier to diagnose.
I've found the single most useful tool in debugging failure modes is a
serial console, as it provides ready scroll-back to earlier console
output, a fairly reliable ability to enter the debugger using a break, as
well as functionality like remote DDB, logging of DDB output, etc. I've
heard people report very similar benefits and experiences with firewire
debugging, but since I don't really live in the world of firewire, I'll
point at serial ports :-).
Robert N M Watson FreeBSD Core Team, TrustedBSD Projects
robert at fledge.watson.org Principal Research Scientist, McAfee Research
More information about the freebsd-current
mailing list