timeout(9), mutexes, and races

Wed Jun 18 21:29:45 PDT 2003

The other day, I had a panic with my 5.1-RELEASE kernel when I
removed my Cardbus NIC (3Com 3c575B Fast Etherlink XL, using the
xl driver.) The traceback indicated a pretty uninteresting race
between a timeout routine (xl_stats_update) and the card being
detached. xl_stats_update was being called after the device's
softc had been freed.

I'm not sure exactly what the problem is, but the following
caught my eye in kern_timeout.c:

                                mtx_unlock_spin(&callout_lock); 
                                if (!(c_flags & CALLOUT_MPSAFE))
                                        mtx_lock(&Giant);

The timeout(9) callouts never have the CALLOUT_MPSAFE flag set,
so we always try to acquire Giant here. But there's an gap where
we can be preempted (mtx_lock is specifically documented that it
can do this), and so the cardbus interrupt could be serviced at
this time, removing the callout entry but still calling it here
when Giant is finally acquired.

Would the solution be to try to detect this condition (callout
removed in an intervening thread) somehow? In the new callout
interface, clients are responsible for allocating the callout
struct, so it may not even exist by the time we get to check
it. The situation seems to be even worse for CALLOUT_MPSAFE
entries, because it wouldn't help to check it before the
mutex has been locked, but if it's not Giant, we have no way
of knowing what mutex it would be...

Or is there another way to solve this somehow? Or am I completely
missing this and seeing the wrong problem? :)

Any ideas would be appreciated.

Eric