ath 11n update: locking, regressions, testing

Tue Oct 25 05:42:47 UTC 2011

Hi all,

In working out kinks in how the ath driver handles things during an
interface reset or reconfigure, I've discovered a couple of things:

* all the frames in the tx/rx queues are just dropped on the floor,
which breaks 802.11n aggregation state in very bad ways; and
* there's just a lot of cases where things can overlap each other,
causing temporary situations of unpredictability (and likely some of
the "wtf?" situations that people have been seeing with the stock
freebsd ath driver.)

I've been tinkering with some code which does two things:

* it introduces locking of most operational paths, much like how the
iwn driver does things, and
* in cases of a recoverable reset (eg a configuration change such as
enabling/disabling ANI, or a stuck beacon condition or watchdog
timeout), the TX/RX queues are properly handled:
  + the TX queue is kept how it is and DMA is simply restarted once
the reset completes;
  + the RX queue is completely drained before the reset occurs, so all
frames are thrown at net80211 and handled.

For reference, linux doesn't do this kind of locking - the ath9k
driver hides reset, channel change and RX behind a new lock - the "PCU
lock" - whilst the TX path doesn't use the PCU lock, it just uses the
normal per-queue TX locking that freebsd/ath9k currently has.

Now, this has shown a few issues which make me kind of wary about
committing this as-is and I'd like some testing and feedback before I
am even remotely comfortable.

Firstly - holding any lock during _both_ TX _and_ RX causes all kinds
of LOR'ing in the IP stack, leading to very quick deadlocks.
The iwn driver releases the IWN_LOCK before calling ieee80211_input,
so it avoids this - but by doing that, it too could lead to some
inconsistencies (eg another process could cause a state change during
active RX.)
So I fixed that in my code - and yes, this means that things such as
ath_start() (ie, the normal TX path), or ath_raw_xmit() (the raw TX
path, for things like management/probe frames) can occur. So I
introduced some variables which keep track of whether the code is in
tx, tx complete, rx process, rx tasklet and reset. This way the driver
can avoid situations where:

* there's a reset, or rx process occuring, which grabs the lock;
* some other task occurs - say, you change the ANI configuration, or
add/remove a VAP;
* rxproc releases the lock to call net80211_input;
* .. and the configuration change, VAP change, etc occurs;
* .. then rxproc continues trying to process RX frames, after the
hardware has been reset.

Now this kind of thing currently can occur, so I'm leaning to just
ignoring it and adding very loud printf()s if the situation is
detected. It can then be fixed at a later stage.

Next - the net80211 scan code (which ath uses, but iwn doesn't as the
firmware does its own scanning) holds the net80211 interface lock
(comlock) during the duration of sending probe requests to a channel.
This means that when scanning, this occurs:

* grab comlock;
* iterate over scan cache contents, sending probe requests
* each probe request calls ath_raw_xmit(), which grabs ATH_LOCK for
the duration of the TX.

Then for normal TX completion:

* ATH_LOCK is grabbed;
* the completed frame has net80211_free_node() called, which grabs the comlock

.. and the above is a very obvious LOR.

I'm not sure how to resolve/fix this without either:

* abandoning my idea for locking the ath driver this way and instead
having to shoehorn in some more complicated state change / taskqueue
drain code (eg so on an operation change, all RX/TX is suspended in
its entirety before the reset occurs); which will require some
net80211 cooperation as the ic_raw_xmit() API can be called by any
thing, and is typically done without any locks being held (except for
the scan code example), so it's likely this will require some
complicated hackery to avoid this;
* fix the scan cache code to not hold the comlock whilst TX'ing probe
request frames - and this likely requires introducing some kind of
generation count to the scan cache contents and if the generation
count changes, the scan process restarts TX'ing probe frames or
something.

That said, this still doesn't fix all of the ath locking problems (and
likely some net80211 locking problems too), which seem increasingly
like they should be fixed, as they seem to be making my 802.11n ath
work _much_ more complicated than it needs to be..

So, I'd like some testing of the attached patch against  my if_ath_tx
branch, and I'd like some feedback/comments. You can trigger beacon
stuck conditions by:

# sysctl dev.ath.X.forcebstuck=1

So you can spam that and generate errors/resets (along with the
occasional hardware error, which is kind of surprising, and I'm going
to try and track down.)

Thanks!

Adrian
-------------- next part --------------
A non-text attachment was scrubbed...
Name: lock-enable-9.diff
Type: application/octet-stream
Size: 21800 bytes
Desc: not available
Url : http://lists.freebsd.org/pipermail/freebsd-wireless/attachments/20111025/be33a8a8/lock-enable-9.obj