[RFC] serialising net80211 TX

Thu Feb 14 05:14:56 UTC 2013

Hi,

I'd like to work on the net80211 TX serialisation now. I'll worry
about driver serialisation and if_transmit methods later.

The 30 second version - it happens in parallel, which means preemption
and multi-core devices can and will hit lots of subtle and
hard-to-debug races in the TX path.

We actually need an end-to-end serialisation method - not only for the
802.11 state (sequence number, correct aggregation handling, etc) but
to line up 802.11 sequence number allocation with the encryption IV/PN
values. Otherwise you end up with lots of crazy subtle out of order
packets occuring. The other is the seqno/CCMP IV race between the raw
transmit path and the normal transmit path. There are other nagging
issues that I'm trying to resolve - but, one thing at a time.

So there are three current contenders:

* wrap everything in net80211 TX in a per-vap TX lock; grab it at the
beginning of ieee80211_output() and ieee80211_start(), and don't
release it until the frame is queued to something (a power save queue,
an age queue, the driver.) That guarantees that the driver is called
in lock-step with each frame being processed.
* do deferred transmit- ie, the net80211 entry points simply queue
mbufs to a queue, and a taskqueue runs over the ifnet queue and runs
those frames in-order. There's no need for a lock here as there's only
one sending context (either per-VAP or per-IC).
* A hybrid setup - use a per-vap TX lock; do a try-acquire on it and
direct dispatch from the queue head if we have it; otherwise defer
frames into a queue and have a taskqueue handle those.

1) is what drivers like iwn(4) do internally.
2) is what I've tinkered with - but we become a slave to the
scheduler. Task switching is expensive and unpredictable; doubly so
for a non-preemption kernel. We'd have to run the TX taskqueue at some
very high priority to get something resembling direct-dispatch
behaviour.
3) is what the gige/10ge drivers do. They hold a big TX lock for each
TX (from xxx_transmit() to hardware dispatch) and if they can't
acquire the TX lock, they defer it to a drbd lockless ring buffer and
service that via a taskqueue.

I can implement any of the above. architecturally I'd prefer 2) - it
massively simplifies and streamlines things, but the scheduling
latency is just plain stab-worthy.I'm tempted to just do 1) for now
and turn it into 3) if we need to.

The main reason against doing 1) (and why 2) is nicer) is recursion -
if the TX path wants to call the net80211 TX code for some odd reason,
we'll hit lock recursion. I'd rather have the system crash at this
point (and then fix the misbehaving driver) but that's just me.

So - what do people think?

Once this is done I'd like to make sure that the wifi chipset drivers
do the same - ie, ensure that the frame order is preserved both
between the normal and the raw xmit paths.
That should fix all of the odd CCMP out of order crap that I see under
heavy, heavy test conditions.

Thanks,

Adrian