Hard(?) lock when reassociating ath with wpa_supplicant on RELENG_7

Wed Aug 27 01:11:13 UTC 2008

On Sat, 2008-07-12 at 21:26 -0700, Sam Leffler wrote:
> Alexandre "Sunny" Kovalenko wrote:
> > On Sat, 2008-07-12 at 09:57 -0700, Sam Leffler wrote:
> >   
> >> Alexandre "Sunny" Kovalenko wrote:
> >>     
> >>> On Fri, 2008-07-11 at 20:29 -0700, Sam Leffler wrote:
> >>>   
> >>>       
> >>>> Alexandre "Sunny" Kovalenko wrote:
> >>>>     
> >>>>         
> >>>>> On Fri, 2008-05-16 at 12:23 -0400, Sam Leffler wrote:
> >>>>>       
> >>>>>           
> >>>>>> Alexandre "Sunny" Kovalenko wrote:
> >>>>>>         
> >>>>>>             
> >>>>>>> On Mon, 2008-05-12 at 19:33 -0700, Sam Leffler wrote: 
> >>>>>>>           
> >>>>>>>               
> >>>>>>>> Alexandre "Sunny" Kovalenko wrote:
> >>>>>>>>             
> >>>>>>>>                 
> >>>>>>>>> I seem to be able to lock my machine by going into wpa_cli and asking it
> >>>>>>>>> to 'reassoc'. The reason for question mark after "hard" is that debug
> >>>>>>>>> information (caused by wlandebug and athdebug) is being printed on the
> >>>>>>>>> console. The only way to get machine's attention is to hold power button
> >>>>>>>>> for 8 seconds.
> >>>>>>>>>               
> >>>>>>>>>                   
> >>>>>>>> So this is just livelock due to console debug msgs.
> >>>>>>>>             
> >>>>>>>>                 
> >>>>>>> I am not sure, I have parsed this well enough, so I will try to clarify:
> >>>>>>> machine becomes unresponsive *without* any debugging turned on, to an
> >>>>>>> extent that pressing the power button twice *does not* generate ACPI
> >>>>>>> console message (something to the tune of "going into S5 already --
> >>>>>>> gimme a break"). If I turn ath debugging on, I do see those messages,
> >>>>>>> and only them, scrolling on the console.
> >>>>>>>           
> >>>>>>>               
> >>>>>> Guess I misunderstood you.
> >>>>>>         
> >>>>>>             
> >>>>> I have finally got enough time and equipment to investigate this
> >>>>> further. Here are some conclusions:
> >>>>>
> >>>>> -- at this point (RELENG_7 as of July 9th around 15:30 EST) it is indeed
> >>>>> a livelock.
> >>>>>
> >>>>> -- all system does, is executing ath_intr (if_ath.c) in the tight loop
> >>>>> with the same status -- 0x1000 AKA HAL_INT_MIB. In order to eliminate
> >>>>> possibility that I have caused livelock with the debug messages, I have
> >>>>> put conditional panic() into ath_intr, as soon as sc->sc_stats.ast_mib
> >>>>> reaches 10,000. Without any kind of the debug messages, it will be
> >>>>> triggered within 40-60 seconds after starting of wpa_supplicant.
> >>>>>
> >>>>> -- I suspect that comment below, might not hold true on my equipment
> >>>>> if (status & HAL_INT_MIB) {
> >>>>>    sc->sc_stats.ast_mib++;
> >>>>>    /*
> >>>>>     * Disable interrupts until we service the MIB
> >>>>>     * interrupt; otherwise it will continue to fire.
> >>>>>     */
> >>>>>    ath_hal_intrset(ah, 0);
> >>>>>    /*
> >>>>>     * Let the hal handle the event.  We assume it will <============
> >>>>>     * clear whatever condition caused the interrupt.   <============
> >>>>>     */
> >>>>>     ath_hal_mibevent(ah, &sc->sc_halstats);
> >>>>>     ath_hal_intrset(ah, sc->sc_imask);
> >>>>> }
> >>>>>
> >>>>> My hardware is:
> >>>>> ath_hal: 0.9.20.3 (AR5210, AR5211, AR5212, RF5111, RF5112, RF2413,
> >>>>> RF5413)
> >>>>> ath0: <Atheros 5212> mem 0xedf00000-0xedf0ffff irq 17 at device 0.0 on
> >>>>> pci3
> >>>>> ath0: [ITHREAD]
> >>>>> ath0: using obsoleted if_watchdog interface
> >>>>> ath0: Ethernet address: 00:16:cf:26:2f:3f
> >>>>> ath0: mac 10.3 phy 6.1 radio 10.2
> >>>>>
> >>>>> My wpa_supplicant.conf is:
> >>>>> ctrl_interface=/var/run/wpa_supplicant
> >>>>> ctrl_interface_group=wheel
> >>>>> eapol_version=2
> >>>>>
> >>>>> network={
> >>>>>   ssid="XXXXXXXXXXX"
> >>>>>   scan_ssid=1
> >>>>>   priority=1
> >>>>>   proto=WPA
> >>>>>   pairwise=TKIP
> >>>>>   group=TKIP
> >>>>>   key_mgmt=WPA-PSK
> >>>>>   psk="xxxxxxxxxxxxxxxxxxxxxx"
> >>>>> }
> >>>>>
> >>>>> Access point is Netgear WNDR3300-1B with 11N and 11G SSID set up to
> >>>>> different values. Only 11G SSID is configured in wpa_supplicant.conf. In
> >>>>> the test setup, AP is with 10' (3m) from the laptop.
> >>>>>
> >>>>> AP is successfully used by handful of Windows clients (including this
> >>>>> same laptop) and iBook G4.
> >>>>>
> >>>>> Neither wpa_supplicant with '-d -d' nor wlandebug 0xFFFFFFFF show
> >>>>> anything but normal scan. 
> >>>>>
> >>>>> athdebug 0xFFFFFFFF loops with ath_intr: status 0x1000 until I power
> >>>>> down my laptop.
> >>>>>
> >>>>> I would appreciate any suggestion on what I can investigate further --
> >>>>> at this point I have comfortable console setup and should be able to
> >>>>> field requests for further information much better.
> >>>>>
> >>>>>       
> >>>>>           
> >>>> Are you running powerd?
> >>>>     
> >>>>         
> >>> I do. And I just tried disabling it, and I could not reproduce the
> >>> problem any more. Is there any way to reconcile if_ath with powerd?
> >>>
> >>>   
> >>>       
> >> Don't know.  There appear to be two issues.  When the MIB interrupts 
> >> arrive the kernel may service them w/ the cpu at a reduced clock 
> >> frequency.  Since powered is currently the only mechanism for increasing 
> >> the frequency and it runs in user space it can take a while to bump the 
> >> clock rate leading to livelock (because the logic to reduce the _cause_ 
> >> of the MIB interrupt takes a long time to run).  John Baldwin suggested 
> >> raising the clock frequency when handling interrupts in the kernel but 
> >> nothing has been done to make that happen.
> >>
> >> Separately there is a question as to why the MIB interrupts are 
> >> happening at all.  This is possibly due to misprogramming of the 
> >> baseband h/w in the ath card.  Unfortunately I've been trying to get 
> >> Atheros to help understand/resolve this question for a very long time 
> >> (as their code also exhibits this behaviour) but they've been 
> >> unresponsive.  I have some experimental code to address this in new hal 
> >> versions (such as 0.10.5.6 available in http://www.freebsd.org/~sam) but 
> >> apparently it does not entirely fix the problem.
> >>     
> > Would it be of any value to you, if I build the new hal and see what
> > happen? I can live without powerd as the workaround, but I'd rather help
> > if I can.
> >
> >   
> Testing the latest hal is always useful but I've not MFC'd many ath 
> changes to RELENG_7 and some people have reported problems on RELENG_7 
> w/ the new hal that do not occur on HEAD.

I have finally carved enough quality time to see whether new HAL makes
any difference WRT the issue, and I am happy to report that, with the
HAL version 0.10.5.6, I could no longer reproduce it. I am seeing whole
lot of messages

update_stats: bogus ndx0 -1, max 12, mode 3

and 

bogus rix 255, max 12, mode 3

scrolling by, but I don't know whether these are related to the problem,
I have seen with the previous version of the HAL, or the downside of
using new HAL with RELENG_7.

Reverting the HAL brings about lockups pretty reliably.

At this point I am very happy with my machine, however, if there is any
additional information that I can provide, or any patches to test, I
will be more than happy to oblige.

Thank you very much for doing this work.

-- 
Alexandre "Sunny" Kovalenko (Олександр Коваленко)