NULL td passed to propagate_priority() when using xmms...

Tue Nov 4 00:51:10 PST 2003

On Mon, 3 Nov 2003, John Baldwin wrote:

> On 01-Nov-2003 Soren Schmidt wrote:
> > It seems Sean Chittenden wrote:
> >> Howdy.  I'm not sure if this is a ULE bug or a KSE bug, or both, but,
> >> for those interested (this is using ule 1.67, rebuilding world now),
> >> here's my stack.  I couldn't figure out where td was being set to
> >> NULL.  :( Oh!  Where is TD_SET_LOCK defined?  egrep -r didn't turn up
> >> anything.  -sc
> >
> > Its not ULE, I'm running 4BSD and has gotten this on boot for over a
> > week now, rendering -current totally useless...
>
> Having a kernel panic with INVARIANTS on would really help narrow down
> where the bug is.

I found something that causes this bug fairly reliably:
- configure ddb so that db_print_backtrace() is called on panics.
- break the fd driver so that the panic() in fdstrategy() is called on
  floppy accesses.
- attempt to access a floppy so that fdstrategy() is called.
- db_print_backtrace() then does bad things.  It never completes here,
  though it works in other contexts.  Usually it prints only the first
  line or two.  Then quite often ddb is called for a null pointer panic
  in propagate_priority().

More details about the null pointer panic:  This seems to have nothing
to do with scheduling.  propagate_priority() is not called with a null
td of course, but it sometimes follows a null m:

%%%
		/*
		 * Pick up the mutex that td is blocked on.
		 */
		m = td->td_blocked;
		MPASS(m != NULL);

		/*
		 * Check if the thread needs to be moved up on
		 * the blocked chain
		 */
		if (td == TAILQ_FIRST(&m->mtx_blocked)) {
			continue;
		}
%%%

I don't have invariants enabled, so MPASS(m != NULL) doesn't do anything,
but m is null so attempting to load m->mtx_blocked causes a panic.

For the backtrace context, propagate_priority() gets called for attempting
to aquire a lock in softclock().  Tasks like the softclock task get
scheduled despite the system being in panic().  ps seemed to show that the
user process doing the floppy access no longer existed.  I don't know how
that could happen, since the panic() is done in the context of the that
process.

More details about bugs in db_print_backtrace(): Maybe the stack is
messed up.  Attempting to access invalid stack offsets can cause
problems.  My version of db_print_backtrace() has extra code to attempt
not to access invalid offsets, but there is normally no problem since
ddb's trap handler fixes up the problem.  But backtrace() bogusly calls
db_print_backtrace() in non-ddb context and then the longjmp in the trap
handler goes to hyperspace if anywhere.

Bugs tripped over while debugging this: Putting a breakpoint in fdopen()
didn't work, because fd.c:fdopen() conflicts with kern_descrip.c:fdopen().
This was broken in fd.c 1.259.  There are hundreds of similar conflicts
in GENERIC, some for obviously broken things like the same malloc type
being static in several files.

Bruce