skipping locks, mutex_owned, usb

Tue Aug 23 12:09:19 UTC 2011

Yes, the subject sounds quite hairy, so please let me try to explain it.
First, let's consider one concrete function:

static int
ukbd_poll(keyboard_t *kbd, int on)
{
        struct ukbd_softc *sc = kbd->kb_data;

        if (!mtx_owned(&Giant)) {
                /* XXX cludge */
                int retval;
                mtx_lock(&Giant);
                retval = ukbd_poll(kbd, on);
                mtx_unlock(&Giant);
                return (retval);
        }

        if (on) {
                sc->sc_flags |= UKBD_FLAG_POLLING;
                sc->sc_poll_thread = curthread;
        } else {
                sc->sc_flags &= ~UKBD_FLAG_POLLING;
                ukbd_start_timer(sc);   /* start timer */
        }
        return (0);
}

This "XXX cludge" [sic] pattern is scattered around a few functions in the ukbd
code and perhaps other usb code:
func()
{
	if (!mtx_owned(&Giant)) {
		mtx_lock(&Giant);
                func();
                mtx_unlock(&Giant);
		return;
	}

	// etc ...
}

I have a few question directly and indirectly related to this pattern.

I.  [Why] do we need this pattern?
Can the code be re-written in a smarter (if not to say proper) way?

II.  ukbd_poll() in particular can be called from the kdb context (kdb_trap ->
db_trap -> db_command_loop -> etc).  What would happen if the Giant is held by a
thread on some other CPU (which would be hard-stopped by kdp_trap)?  Won't we get
a lockup here?
So shouldn't this code actually check for kdb_active?

III.  With my stop_scheduler_on_panic patch ukbd_poll() produces infinite chains
of 'infinite' recursion because mtx_owned() always returns false.  This is because
I skip all lock/unlock/etc operations in the post-panic context.  I think that
it's a good philosophical question: what operations like mtx_owned(),
mtx_recursed(), mtx_trylock() 'should' return when we actually act as if no locks
exist at all?

My first knee-jerk reaction was to change mtx_owned() to return true in this
"lock-less" context, but _hypothetically_ there could exist some symmetric code
that does something like:
func()
{
	if (mtx_owned(&Giant)) {
		mtx_unlock(&Giant);
                func();
                mtx_lock(&Giant);
		return;
	}

	// etc ...

What do you think about this problem?
Should we re-write ukbd_poll() (and possibly usb code) or should mutex_owned() be
adjusted?

That question III actually brings another thought: perhaps the whole of idea of
skipping locks in a certain context is not a correct direction?
Perhaps, instead we should identify all the code that could be legitimately
executed in the after-panic and/or kdb contexts and make that could explicitly
aware of its execution context.  That is, instead of trying to make _lock,
_unlock, _owned, _trylock, etc do the right thing auto-magically, we should try to
make the calling code check panicstr and kdb_active and do the right thing on that
level (which would include not invoking those lock-related operations or other
inappropriate operations).

Thank you very much in advance for your insights and help!
-- 
Andriy Gapon