debuging a hung kernel

Mon May 28 10:29:36 UTC 2007

Redirecting from @net to @stable. Please,
remove @net from future mails.

On Monday 28 May 2007 11:54, Robert Watson wrote:
> On Mon, 28 May 2007, Julian Elischer wrote:
> > Nikos Vassiliadis wrote:
> >> On Monday 28 May 2007 10:57, Julian Elischer wrote:
> >>> Nikos Vassiliadis wrote:
> >>>> On Tuesday 22 May 2007 10:06, I wrote:
> >>>>> Hello everybody,
> >>>>>
> >>>>>  I just managed to lock my box and I want to report it
> >>>
> >>> define "lock"?
> >>>
> >>> Does it still respond to <CTL><ALT><ESC> on the keyboard?
> >>
> >> No, but I was trying to break to the debugger with
> >> <Control><PrintScreen> myself. I assume that it is
> >> equivalent to the combination you wrote, or not?
> >>
> >>> (Assuming you have the debugger in your kernel?).
> >>
> >> Yes, I have included my kernel configuration, see bellow.
> >>
> >>> Does it still ping?
> >>
> >> no, ARP does not work as well.
> >
> > nasty.. do you have IPMI? sometimes that allows you to generate an NMI
> > that could theoretically be made to drop to the debugger.

I have a Dell PowerEdge 750 sitting at work, which
I think has IPMI.

I'll be able to try a few things next week, since
I will be off work for this week.

> >
> > I've not had success with that but I have heard others have.
>
> An increase number of server motherboards have an NMI button on the
> motherboard, possibly exposed outside the case, but generally not.
>
> I've not tested it in over a year, but a few years ago I added an
> MP_WATCHDOG kernel option that causes one of the CPUs in an SMP system
> to become a dedicated watchdog CPU, checking to see if the OS is alive
> enough to process timer tickets.  If a counter isn't updated, it
> generates an NMI to the debugger from the watchdog CPU.  The idea here
> is that, as the number of CPUs increases, the cost of dedicating a CPU
> for debugging stuff gets lower. However, there have been quite a few
> scheduler changes in the last few years, and it's possible that the
> watchdog no longer properly excludes other work from being scheduled,
> and that further work is required.  In particular, I believe it relies
> on 4BSD's "pull" scheduling model and a lack of per-CPU workers, so the
> mechanism may require some rethinking.

Unfortunately, I have not an SMP system available. Is there a
mechanism which I can use to schedule a break to the debugger
after n seconds or events?

I am looking if ichwd(4) can help, though that needs investigation
since I have not used watchdog facilities before.

Thanks Julian & Robert.