minidumps are unsafe on amd64

Fri Jan 25 13:04:02 PST 2008

On Fri, Jan 25, 2008 at 08:10:47PM +0000, Robert Watson wrote:
> 
> On Fri, 25 Jan 2008, Scott Long wrote:
> 
Hmm.  Somehow I didn't get your email, only this one.

>> Is this a case where you are manually triggering a dump on a system that 
>> is otherwise running fine?

Not quite.  We were testing how dumps work on GEOM mirror,
to set it up on production boxes, when we found that dumping
destroys the mirror itself.  Our testing includes setting
debug.kdb.panic=1 to trigger a panic and a dump.   Our test
system has four dual-core CPUs.

>> I thought that crashes already disabled 
>> interrupts and made an attempt to stop other CPUs.  That's why there is 
>> dump-specific code in every storage driver in the first place; it 
>> implements polled i/o so that crashdump i/o can take place with interrupts 
>> disabled.  If it's a case where interrupts aren't actually getting 
>> disabled, then that's one thing. If it's a case where you're trying to fix 
>> something that isn't broken, then I'm very cautious about the added 
>> complexity that you're proposing.
> 
> Unfortunately, we don't really do this today -- we do stop the other CPUs 
> when we enter the debugger, but we restart them when we leave, and the dump 
> code runs outside of the debugger context.  I ran into this problem when 
> working on textdumps, as common storage drivers attempt to acquire locks in 
> their dump path.  Instead of writing out DDB output incrementally 
> block-at-a-time, I have to buffer it all and then generate it at the normal 
> dump point after leaving the debugger.
> 
Yes, and about interrupts, we don't really disable them now:

: static void
: boot(int howto)
: {
: [...]
: 	/* XXX This doesn't disable interrupts any more.  Reconsider? */
: 	splhigh();
: 
: 	if ((howto & (RB_HALT|RB_DUMP)) == RB_DUMP && !cold && !dumping)
: 		doadump();

> In terms of generally improving robustness of the debugging environment, 
> I've been pondering the following:
> 
> - Dump routines run from the KDB context, so that they get the protections
>   associated with running in the debugger.  In particular, they need a more
>   reliable assumption that the rest of the kernel is halted.  I'm a bit
>   surprised we haven't been bitten by this more in the past...
> 
> - A more SMP-safe passage into the debugger, especially from panic().  We
>   should disable interrupts immediately on panic() to prevent preemption on
>   the panicking CPU by an interrupt.  We should write any state to pass into
>   the debugger into a per-CPU buffer to be picked up after kdb_trap() has
>   popped us into the debugger.  The panic message should be printed by KDB,
>   and not using printf(), which is prone to preemption especially on serial
>   consoles.
> 
> - Dump routines pass through a bounds checking block write call.  Right now
>   they directly invoke di->dumper(), and the caller is responsible for not
>   asking for blocks outside the swap partition.  A wrapper on the order of
>   dump_blockwrite() should do the bounds checking to add robustness
>   (obviously, callers should also place their blocks correctly).
> 
> I'm almost certainly not the right person to look at making dumper routines 
> work in KDB, but I can look at improving the reliability of getting into 
> KDB, as well as passing data into it more reliably.  I'm happy to let 
> someone else pick this up and run with it, though, as it will be a ways 
> down on my TODO list for a bit.
> 
While experimenting, I also found that to safely "call doadump"
from the debugger, and then be able to "continue", one needs this:

%%%
Index: kern_shutdown.c
===================================================================
RCS file: /home/ncvs/src/sys/kern/kern_shutdown.c,v
retrieving revision 1.188
diff -u -p -r1.188 kern_shutdown.c

--- kern_shutdown.c	19 Jan 2008 17:36:22 -0000	1.188
+++ kern_shutdown.c	25 Jan 2008 20:54:34 -0000
@@ -249,6 +249,7 @@ doadump(void)
 	else
 #endif
 		dumpsys(&dumper);
+	dumping--;
 }
 
 static int
%%%

Some code (ata(4) in my case) behaves slightly differently
when "dumping", such as NOT acquiring locks, etc.


Cheers,
-- 
Ruslan Ermilov
ru at FreeBSD.org
FreeBSD committer