Kernel dumps [was Re: possible changes from Panzura]

Thu Jul 11 14:28:04 UTC 2013

On 7/11/13 6:09 AM, Kevin Day wrote:
>>
>> Those sound useful.   Just out of curiosity, however, since we're on the topic of kernel dumps:  Has anyone even looked into the notion of an emergency fall-back network stack to enable remote kernel panic (or system hang) debugging, the way OS X lets you do?  I can't tell you the number of times I've NMI'd a Mac and connected to it remotely in a scenario where everything was totally wedged and just a couple of minutes in kgdb (or now lldb) quickly showed that everything was waiting on a specific lock and the problem became manifestly clear.
>>
>> The feature also lets you scrape a panic'd machine with automation, running some kgdb scripts against it to glean useful information for later analysis vs having to have someone schlep the dump image manually to triage.  It's going to be damn hard to live without this now, and if someone else isn't working on it, that's good to know too!

I could imagine that we could stash away a vimage stack just for this 
purpose.
yould set it up on boot and leave it detached until you need it.

you just need to switch the interfaces over to the new stack on panic 
and put them into 'poll' mode.

Or maybe you'd need more (like pre-allocating mbufs for it to use).

Just an idea.

>
> At a previous employer, we had a system where on a panic it had a totally separate stack capable of just IP/UDP/TFTP and would save its core via TFTP to a server. This isn’t as nice as full remote debugging, but it was a whole lot easier to develop. The caveats I remember were:
>
> 1) We didn’t want to implement ARP, so you had to write the mac address of the “dump server” to the kernel via sysctl before crashing.
> 2) We also didn’t want to have to deal with routing tables, so you had to manually specify what interface to blast packets out to, also via sysctl.
> 3) After a panic we didn’t want to rely on interrupt processing working, so it polled the network interface and blocked whenever it needed to. Since this was an embedded system, it wasn’t too big of a deal - only one network driver had to be hacked to support this. Basically a flag that would switch to “disable normal processing, switch to polled fifos for input and output” until reboot.
> 4) The whole system used only preallocated buffers and its own stack (carved out from memory on boot) so even if the kernel’s malloc was trashed, we could still dump.
>
> I’m not sure this really would scratch your itch, but I believe this took me no more than a day or two to implement. Parts #1 and #2 would be pretty easy, but I’m not sure how generic the kernel could support an emergency network mode that doesn’t require interrupts for every network card out there. Maybe that isn’t as important to you as it was to us.
>
> The whole exercise is much easier if you don’t use TFTP but a custom protocol that doesn’t require the crashing system to receive any packets, if it can just blast away at some random host oblivious if it’s working or not, it’s a lot less code to write.
>
>
> _______________________________________________
> freebsd-hackers at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
> To unsubscribe, send any mail to "freebsd-hackers-unsubscribe at freebsd.org"
>
>