RFT: if_ath HAL refactoring

Thu Sep 23 00:35:39 UTC 2010

----- Original Message ----
> From: Rui Paulo <rpaulo at FreeBSD.org>
> To: PseudoCylon <moonlightakkiy at yahoo.ca>
> Cc: Bernhard Schmidt <bschmidt at techwires.net>; freebsd-current at freebsd.org; 
>Adrian Chadd <adrian at freebsd.org>
> Sent: Wed, September 22, 2010 4:48:14 PM
> Subject: Re: RFT: if_ath HAL refactoring
> 
> On 22 Sep 2010, at 23:42, PseudoCylon wrote:
> 
> > 
> > 
> > 
> > 
> > ----- Original Message ----
> >> From: Bernhard Schmidt  <bschmidt at techwires.net>
> >>  To: freebsd-current at freebsd.org
> >>  Cc: PseudoCylon <moonlightakkiy at yahoo.ca>; Adrian  Chadd 
><adrian at freebsd.org>
> >> Sent:  Wed, September 22, 2010 12:09:36 AM
> >> Subject: Re: RFT: if_ath HAL  refactoring
> >> 
> >> On Wednesday, September 22, 2010 06:04:49  PseudoCylon wrote:
> >>> -----  Original Message  ----
> >>> 
> >>>> From: Adrian Chadd <adrian at freebsd.org>
> >>>>  To:  PseudoCylon <moonlightakkiy at yahoo.ca>
> >>>>  Cc: freebsd-current at freebsd.org
> >>>>  Sent: Tue, September 21, 2010 7:04:37 AM
> >>>> Subject: Re:  RFT:  if_ath HAL refactoring
> >>>> 
> >>>> On 21  September 2010 11:58,  PseudoCylon <moonlightakkiy at yahoo.ca>  
> > wrote:
> >>>>> Just in case anyone wonders, I've added  11n support to  run(4)  (USB
> >>>>> NIC). http://gitorious.org/run/run/trees/11n_beta2
> >>>>> 
> >>>>> It still has some issues,
> >>>>> 
> >>>>> *  doesn't work well with atheros  chips
> >>>>> 
> >>>>>  * HT + AP + bridge  = Tx may stall (seems OK with nat)
> >>>>> 
> >>>>> So, use it at your  own  discretion.
> >>>> 
> >>>> Want to put together a  patch?
> >>> 
> >>> sure!
> >>> 
> >>>> Does  it introduce  issues in the non-11n  case?
> >>> 
> >>> No, only in 11n   mode.
> >>> 
> >>> What I have found so far is that Ralink's  driver checks  MAC address of
> >>> other end and identify  atheros chip by oui. Then, sets  special prot mode
> >>> for it.  Does this ring a bell?
> >> 
> >> Are your sure  that this is  based on the actual MAC addresses? Atheros 
>drivers 
>
> > 
> >> tend  to  announce additional capabilities in beacons and probe  responses.
> > 
> > It is based on the actual MAC, but it is Broadcom's  oui (00904c). sorry.
> > 
> >> 
> >>> Has  node lock  in ieee80211_node_timeout() cased dead lock in HT + AP +
> >>>  bridge?
> >> 
> >> I'm not aware of any issues there, though, I'm  not very familiar  with HT 
>use 
>
> >> cases.
> > 
> > I  attached witness messages. Those 2 LORs always happen together before 
> >  deadlock. I hooked iv_input() and unlock/lock node lock to avoid deadlock. 
>(I 
>
> > don't know if it's safe.)
> > 
> > I wonder if this is run(4)  specific problem.
> > 
> > 
> > AK
> > 
> > 
> > lock  order reversal:
> > 1st 0xffffff8000a267d0 run0_node_lock (run0_node_lock) @ 
> > /usr/src/sys/net80211/ieee80211_node.c:1360
> > 2nd  0xffffff0001716818 if_bridge (if_bridge) @ 
> >  /usr/src/sys/net/if_bridge.c:2184
> > KDB: stack backtrace:
> >  db_trace_self_wrapper() at db_trace_self_wrapper+0x2a
> >  _witness_debugger() at _witness_debugger+0x2e
> > witness_checkorder() at  witness_checkorder+0x81e
> > _mtx_lock_flags() at  _mtx_lock_flags+0x78
> > bridge_input() at bridge_input+0x7e
> >  ether_input() at ether_input+0x143
> > hostap_input() at  hostap_input+0x4ea
> > ampdu_rx_flush() at ampdu_rx_flush+0x5e
> >  ieee80211_ht_node_age() at ieee80211_ht_node_age+0x7b
> >  ieee80211_node_timeout() at ieee80211_node_timeout+0x2dc
> > softclock() at  softclock+0x2a0
> > intr_event_execute_handlers() at  intr_event_execute_handlers+0x66
> > ithread_loop() at  ithread_loop+0xb2
> > fork_exit() at fork_exit+0x12a
> >  fork_trampoline() at fork_trampoline+0xe
> > --- trap 0, rip = 0, rsp =  0xffffff8000052d30, rbp = 0 ---
> > 
> > lock order reversal:
> >  1st 0xffffff8000a267d0 run0_node_lock (run0_node_lock) @ 
> >  /usr/src/sys/net80211/ieee80211_node.c:1360
> > 2nd 0xffffffff80a186c8 tcp  (tcp) @ /usr/src/sys/netinet/tcp_input.c:498
> > KDB: stack  backtrace:
> > db_trace_self_wrapper() at db_trace_self_wrapper+0x2a
> >  _witness_debugger() at _witness_debugger+0x2e
> > witness_checkorder() at  witness_checkorder+0x81e
> > _rw_rlock() at _rw_rlock+0x5f
> >  tcp_input() at tcp_input+0xa58
> > ip_input() at ip_input+0xbc
> >  netisr_dispatch_src() at netisr_dispatch_src+0xb8
> > ether_demux() at  ether_demux+0x17d
> > ether_input() at ether_input+0x175
> >  hostap_input() at hostap_input+0x4ea
> > ampdu_rx_flush() at  ampdu_rx_flush+0x5e
> > ieee80211_ht_node_age() at  ieee80211_ht_node_age+0x7b
> > ieee80211_node_timeout() at  ieee80211_node_timeout+0x2dc
> > softclock() at softclock+0x2a0
> >  intr_event_execute_handlers() at intr_event_execute_handlers+0x66
> >  ithread_loop() at ithread_loop+0xb2
> > fork_exit() at  fork_exit+0x12a
> > fork_trampoline() at fork_trampoline+0xe
> > ---  trap 0, rip = 0, rsp = 0xffffff8000052d30, rbp = 0 --- 
> 
> Can you explain  why the run0_node_lock is locked ? I don't have the code at  
>hand..
> 
> Regards,
> --
> Rui Paulo
> 
> 

I don't know why, but I know where.

run0_node_lock is locked at ieee80211_node.c:1917
ieee80211_node_timeout() -> ieee80211_timeout_stations()
http://fxr.watson.org/fxr/source/net80211/ieee80211_node.c?im=bigexcerpts#L1917

ieee80211_node.c:1360 (one witness reports)
hostap_input() -> hostap_deliver_data() ->ieee80211_find_vap_node() -> lock 
@ ieee80211_node.c:1360 (I think it's recursed.)

and
run(4) calls ieee80211_iterate_nodes() once/sec for ratectl. (locks @ 
ieee80211_node.c:2138)

Each one has own reason to lock, I guess.

My workaround.
http://gitorious.org/run/run/blobs/11n_beta2/dev/usb/wlan/if_run.c :1865
unlocks one locked in ieee80211_timeout_stations(). This one is held for long 
time.

Hope this is what you want to know.

AK