Crash dump problem - sleeping thread owns a non-sleepable lock during crash dump write

Fri May 14 14:16:49 UTC 2010

> > Hmm.  You could try changing the code to not do a nested panic in that
> > case.  You would update subr_turnstile.c to just return if panicstr is
> > not NULL rather than calling panic.  However, there is still a good
> > chance you will end up deadlocking in that case.  I have another patch I
> > can send you next week that prevents blocking on mutexes duing a panic
> > which may also help.
>
> It would be instructive to know exactly why we were in turnstile(9) but 
> its likely due to mtx contention.
>

> AIX has some code at the beginning of all the locking operations to avoid 
> taking locks if we were running code out of kdb, though getting that worked 
> out was slightly tricky with our variant of mtx_assert(9).  I seem to recall
> there was also some "lockbusting" code that forcibly reset all owned locks 
> to have no owner, at least in some paths.

> Given that the system is single-cpu and should be single-threaded when 
> dumping, this seems to me to be something worth working through to get 
> more reliable dumps.  Except for mtx_assert(9) I cant think of a reason 
> to take locks once we start dumping or when in the debugger.

  As an aside, this is a quad-core in one package CPU (an X3363). On both
this box and a similar one with an X5470, console messages continue to
print out after "the system has been halted - press any key to reboot" -
in particular, the shutdown makes a bunch of the "behind the scenes" man-
agement stuff like the virtual keyboard and monitor appear. Plugging or
unplugging USB devices will go through the whole deal of detecting and
making their service available.

  I know the other CPUs are considered to still be running (hence the
"halting other CPUs" when you press a key to reboot), but this is the
first time I've seen device detection, attachment, etc. show up on the
console after a shutdown.

  Is this behavior to be expected, or is it as unexpected as it was to
me? Systems are Dell Poweredge R300's, 8-STABLE amd64.

> As an aside, with terribly corrupted locks Ive seen double panics when the 
> attempt to print the lock name faulted in strlen(9) called for printf(9), 
> due to a bad lockname pointer.  We have been able to get enough info off 
> these crashes to debug them, but its useful to remember that the system 
> may be in a very unstable state depending on why it panics.

  True. In these crashes, the system is doing essentially nothing except
the one application (which, unfortunately, I don't have the source code
for). The second crash happened right after booting the system, logging in,
and firing off the application. It left an identical footprint (other than
the 0x10 byte offset due to a recompiled kernel) from the first one, where
the system had been up for 13+ hours.

  So, in this case I don't think there was a bunch of corruption piling up
which triggered the fault, but instead the one simple operation and right
away - splat!

  As I mentioned in the original posting, I'd be glad to give a developer
complete access to the system via the remote console (Dell DRAC 5 web
interface) and to the underlying FreeBSD if it'll help pin down the prob-
lem.

  Another thing I could try (would take a couple days until I could get
someone to the site) would be to try this using a bge port instead of
the bce one. That might help pin it down to either something in the bce-
specific code path, or somewhere else in the stack.

	Thanks,
        Terry Kennedy             http://www.tmk.com
        terry at tmk.com             New York, NY USA