misc/165060: [ath] vap->iv_bss race conditions causing crashes
inside ath_beacon_alloc and similar
adrian at FreeBSD.org
Sun Feb 12 23:40:06 UTC 2012
>Synopsis: [ath] vap->iv_bss race conditions causing crashes inside ath_beacon_alloc and similar
>Arrival-Date: Sun Feb 12 23:40:05 UTC 2012
>Originator: Adrian Chadd
>Release: 9.0-RELEASE, running -HEAD ath/net80211
There are a variety of crashes inside the ath driver which can be traced down to races between iv->iv_bss modify/reallocate/free.
>From an email I sent to freebsd-wireless:
I've noticed some kernel panics in net80211/ath in -HEAD. It in all instances boils down to a now-invalid ieee80211_node - either it's partially allocated/copied, or it's been recently freed.
This became increasingly obvious when doing DFS CAC, as the kernel was now changing the channel quite frequently on me whilst simulating/processing radar events. I've since found I can mostly reproduce it in the lab (when surrounded by ridiculous levels of RX intereference traffic, triggering all kinds of events) whilst creating/destroying VAPs.
Now that I have debugging code in place (which as a side effect makes it very difficult now to cause a crash, let alone tickle the race condition) it's glaringly obvious what's going on.
There's five contexts stuff can occur, at least in the net80211/ath case:
* the swi (ie ath_intr(), ath_beacon_proc)
* the ath taskqueue;
* the net80211 taskqueue;
* the ioctl() context, coming up from a userland process;
* a callout running in the clock thread.
Now, callouts should _hopefully_ be grabbing and releasing locks correctly. We've found a few spots where they weren't (leading to quite silly state races and crashes.)
I'm going to ignore the obvious possible problems with multiple concurrent processes doing ioctl()s. l'm simply going to operate on the principle that the multiple-ioctl() path is fine.
It seems that -obtaining- references to vap->iv_bss aren't locked. So in (say) ieee80211_sta_join1() the iv_bss node can be dereferenced and freed. If this is going on concurrently with (say) something going on in the net80211 taskqueue (eg a newstate call) then I _think_ it's possible for the ath_newstate() code to get a reference to vap->iv_bss simultaneously with it being freed in ieee80211_sta_join1() (or similar.) So the ath_newstate() code will be assigned a 'ni' that has just been freed.
I've seen another crash in the net80211_ht code where it _looks_ like the bss node wasn't entirely setup - bsschan was 0xffff - so the kernel paniced hard there.
This likely explains a lot of the "weird stuff" people have been reporting. I also think the bgscan race is related to this - I can't help but wonder if the bgscan callout/event is also coinciding with wpa_supplicant doing stuff, and a race condition ends up leaving the vap w/ the sta power save flag set.
I don't yet have a solution to all of this - I just wanted to brain dump what I've seen thus far.
It's unfortunately not easy to reproduce in a clean environment. It seemed very easy to reproduce in a radio-noisy environment where the RX handler is constantly being scheduled.
Someone pointed this out:
Here is my brain dump.
While ago usb wifi drivers had the slimier issue (race in 80211
stack). It's worth checking this rev.
More information about the freebsd-bugs