net80211 race conditions seen in -HEAD

PseudoCylon moonlightakkiy at
Wed Jan 25 14:56:03 UTC 2012

> Hi,
> I've noticed some kernel panics in net80211/ath in -HEAD. It in all
> instances boils down to a now-invalid ieee80211_node - either it's
> partially allocated/copied, or it's been recently freed.
> This became increasingly obvious when doing DFS CAC, as the kernel was now
> changing the channel quite frequently on me whilst simulating/processing
> radar events. I've since found I can mostly reproduce it in the lab (when
> surrounded by ridiculous levels of RX intereference traffic, triggering all
> kinds of events) whilst creating/destroying VAPs.
> Now that I have debugging code in place (which as a side effect makes it
> very difficult now to cause a crash, let alone tickle the race condition)
> it's glaringly obvious what's going on.
> There's five contexts stuff can occur, at least in the net80211/ath case:
> * the swi (ie ath_intr(), ath_beacon_proc)
> * the ath taskqueue;
> * the net80211 taskqueue;
> * the ioctl() context, coming up from a userland process;
> * a callout running in the clock thread.
> Now, callouts should _hopefully_ be grabbing and releasing locks correctly.
> We've found a few spots where they weren't (leading to quite silly state
> races and crashes.)
> I'm going to ignore the obvious possible problems with multiple concurrent
> processes doing ioctl()s. l'm simply going to operate on the principle that
> the multiple-ioctl() path is fine.
> It seems that -obtaining- references to vap->iv_bss aren't locked. So in
> (say) ieee80211_sta_join1() the iv_bss node can be dereferenced and freed.
> If this is going on concurrently with (say) something going on in the
> net80211 taskqueue (eg a newstate call) then I _think_ it's possible for
> the ath_newstate() code to get a reference to vap->iv_bss simultaneously
> with it being freed in ieee80211_sta_join1() (or similar.) So the
> ath_newstate() code will be assigned a 'ni' that has just been freed.
> I've seen another crash in the net80211_ht code where it _looks_ like the
> bss node wasn't entirely setup - bsschan was 0xffff - so the kernel paniced
> hard there.
> This likely explains a lot of the "weird stuff" people have been reporting.
> I also think the bgscan race is related to this - I can't help but wonder
> if the bgscan callout/event is also coinciding with wpa_supplicant doing
> stuff, and a race condition ends up leaving the vap w/ the sta power save
> flag set.
> I don't yet have a solution to all of this - I just wanted to brain dump
> what I've seen thus far.

Here is my brain dump.

While ago usb wifi drivers had the slimier issue (race in 80211
stack). It's worth checking this rev.


