kern/162647: [ath] 11n TX aggregation session / TX hang

Fri Nov 18 02:20:03 UTC 2011

>Number:         162647
>Category:       kern
>Synopsis:       [ath] 11n TX aggregation session / TX hang
>Confidential:   no
>Severity:       non-critical
>Priority:       medium
>Responsible:    freebsd-bugs
>State:          open
>Quarter:        
>Keywords:       
>Date-Required:
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Fri Nov 18 02:20:02 UTC 2011
>Closed-Date:
>Last-Modified:
>Originator:     Adrian Chadd
>Release:        10.0-CURRENT
>Organization:
FreeBSD
>Environment:
FreeBSD unknown 10.0-CURRENT FreeBSD 10.0-CURRENT #37: Thu Jan  1 08:00:00 WST 1970     adrian at dummy:/home/adrian/work/freebsd/git/adrianchadd-freebsd-work/obj/mipseb/mips.mipseb/home/adrian/work/freebsd/git/adrianchadd-freebsd-work/adrianchadd-freebsd-work/sys/RSPRO  mips

(it's a recent -HEAD, ignore the date.)

hostap: AR9227

ath1: <Atheros 9227> irq 1 at device 18.0 on pci0
ath1: [HT] enabling HT modes
ath1: [HT] enabling short-GI in 20MHz mode
ath1: [HT] 2 RX streams; 2 TX streams
ath1: AR9227 mac 384.2 RF5133 phy 15.15

This also includes my git fixes for correctly handling packet queue flushes during reset, but this bug will occur regardless.
>Description:
A node flush is causing the BA window to be completely messed up, resulting in TX timeouts.

It's currently unknown why ath_tx_tid_drain() was called - that's called from:

* ath_tx_txq_drain()
* ath_tx_node_flush()

The former is called during ath_tx_draintxq(); the latter is called from ath_node_cleanup(). So either it's being called during ath_reset(sc, ATH_RESET_DEFAULT or ATH_RESET_FULL); or ic_node_cleanup.

A log snippet:

ath1: ath_tx_aggr_comp_aggr: TID 0: send BAR; seq 3678
ath1: ath_tx_aggr_comp_aggr: TID 0: send BAR; seq 3718
ath1: ath_tx_aggr_comp_aggr: TID 0: send BAR; seq 3742
ath1: ath_tx_aggr_comp_aggr: TID 0: send BAR; seq 3784
ath1: stuck beacon; resetting (bmiss count 4)
ath1: ath_tx_tid_drain: node 0xc0927000: tid 0: txq_depth=2, txq_aggr_depth=2, sched=0, paused=0, hwq_depth=2, incomp=0, baw_head=103, baw_tail=38 txa_start=3396, ni_txseqs=3861
ath1: ath_tx_tid_drain: wasn't added: seqno 3459
ath1: ath_tx_tid_drain: wasn't added: seqno 3460
.
.
ath1: ath_tx_tid_drain: wasn't added: seqno 3857
ath1: ath_tx_tid_drain: wasn't added: seqno 3858
ath1: ath_tx_tid_drain: wasn't added: seqno 3859
ath1: ath_tx_tid_drain: wasn't added: seqno 3860
ath1: ath_tx_default_comp: dobaw should've been cleared!
ath1: ath_tx_default_comp: dobaw should've been cleared!
ath1: ath_tx_default_comp: dobaw should've been cleared!
ath1: ath_tx_default_comp: dobaw should've been cleared!
ath1: ath_tx_default_comp: dobaw should've been cleared!
ath1: ath_tx_default_comp: dobaw should've been cleared!
ath1: ath_tx_default_comp: dobaw should've been cleared!
ath1: ath_tx_default_comp: dobaw should've been cleared!
ath1: device timeout

>How-To-Repeat:
Just general hostap use. The question is how/why the node flush occured.
>Fix:

>Release-Note:
>Audit-Trail:
>Unformatted: