Hard(?) lock when reassociating ath with wpa_supplicant on RELENG_7

Sat Jul 12 16:57:03 UTC 2008

Alexandre "Sunny" Kovalenko wrote:
> On Fri, 2008-07-11 at 20:29 -0700, Sam Leffler wrote:
>   
>> Alexandre "Sunny" Kovalenko wrote:
>>     
>>> On Fri, 2008-05-16 at 12:23 -0400, Sam Leffler wrote:
>>>       
>>>> Alexandre "Sunny" Kovalenko wrote:
>>>>         
>>>>> On Mon, 2008-05-12 at 19:33 -0700, Sam Leffler wrote: 
>>>>>           
>>>>>> Alexandre "Sunny" Kovalenko wrote:
>>>>>>             
>>>>>>> I seem to be able to lock my machine by going into wpa_cli and asking it
>>>>>>> to 'reassoc'. The reason for question mark after "hard" is that debug
>>>>>>> information (caused by wlandebug and athdebug) is being printed on the
>>>>>>> console. The only way to get machine's attention is to hold power button
>>>>>>> for 8 seconds.
>>>>>>>               
>>>>>> So this is just livelock due to console debug msgs.
>>>>>>             
>>>>> I am not sure, I have parsed this well enough, so I will try to clarify:
>>>>> machine becomes unresponsive *without* any debugging turned on, to an
>>>>> extent that pressing the power button twice *does not* generate ACPI
>>>>> console message (something to the tune of "going into S5 already --
>>>>> gimme a break"). If I turn ath debugging on, I do see those messages,
>>>>> and only them, scrolling on the console.
>>>>>           
>>>> Guess I misunderstood you.
>>>>         
>>> I have finally got enough time and equipment to investigate this
>>> further. Here are some conclusions:
>>>
>>> -- at this point (RELENG_7 as of July 9th around 15:30 EST) it is indeed
>>> a livelock.
>>>
>>> -- all system does, is executing ath_intr (if_ath.c) in the tight loop
>>> with the same status -- 0x1000 AKA HAL_INT_MIB. In order to eliminate
>>> possibility that I have caused livelock with the debug messages, I have
>>> put conditional panic() into ath_intr, as soon as sc->sc_stats.ast_mib
>>> reaches 10,000. Without any kind of the debug messages, it will be
>>> triggered within 40-60 seconds after starting of wpa_supplicant.
>>>
>>> -- I suspect that comment below, might not hold true on my equipment
>>> if (status & HAL_INT_MIB) {
>>>    sc->sc_stats.ast_mib++;
>>>    /*
>>>     * Disable interrupts until we service the MIB
>>>     * interrupt; otherwise it will continue to fire.
>>>     */
>>>    ath_hal_intrset(ah, 0);
>>>    /*
>>>     * Let the hal handle the event.  We assume it will <============
>>>     * clear whatever condition caused the interrupt.   <============
>>>     */
>>>     ath_hal_mibevent(ah, &sc->sc_halstats);
>>>     ath_hal_intrset(ah, sc->sc_imask);
>>> }
>>>
>>> My hardware is:
>>> ath_hal: 0.9.20.3 (AR5210, AR5211, AR5212, RF5111, RF5112, RF2413,
>>> RF5413)
>>> ath0: <Atheros 5212> mem 0xedf00000-0xedf0ffff irq 17 at device 0.0 on
>>> pci3
>>> ath0: [ITHREAD]
>>> ath0: using obsoleted if_watchdog interface
>>> ath0: Ethernet address: 00:16:cf:26:2f:3f
>>> ath0: mac 10.3 phy 6.1 radio 10.2
>>>
>>> My wpa_supplicant.conf is:
>>> ctrl_interface=/var/run/wpa_supplicant
>>> ctrl_interface_group=wheel
>>> eapol_version=2
>>>
>>> network={
>>>   ssid="XXXXXXXXXXX"
>>>   scan_ssid=1
>>>   priority=1
>>>   proto=WPA
>>>   pairwise=TKIP
>>>   group=TKIP
>>>   key_mgmt=WPA-PSK
>>>   psk="xxxxxxxxxxxxxxxxxxxxxx"
>>> }
>>>
>>> Access point is Netgear WNDR3300-1B with 11N and 11G SSID set up to
>>> different values. Only 11G SSID is configured in wpa_supplicant.conf. In
>>> the test setup, AP is with 10' (3m) from the laptop.
>>>
>>> AP is successfully used by handful of Windows clients (including this
>>> same laptop) and iBook G4.
>>>
>>> Neither wpa_supplicant with '-d -d' nor wlandebug 0xFFFFFFFF show
>>> anything but normal scan. 
>>>
>>> athdebug 0xFFFFFFFF loops with ath_intr: status 0x1000 until I power
>>> down my laptop.
>>>
>>> I would appreciate any suggestion on what I can investigate further --
>>> at this point I have comfortable console setup and should be able to
>>> field requests for further information much better.
>>>
>>>       
>> Are you running powerd?
>>     
> I do. And I just tried disabling it, and I could not reproduce the
> problem any more. Is there any way to reconcile if_ath with powerd?
>
>   

Don't know.  There appear to be two issues.  When the MIB interrupts 
arrive the kernel may service them w/ the cpu at a reduced clock 
frequency.  Since powered is currently the only mechanism for increasing 
the frequency and it runs in user space it can take a while to bump the 
clock rate leading to livelock (because the logic to reduce the _cause_ 
of the MIB interrupt takes a long time to run).  John Baldwin suggested 
raising the clock frequency when handling interrupts in the kernel but 
nothing has been done to make that happen.

Separately there is a question as to why the MIB interrupts are 
happening at all.  This is possibly due to misprogramming of the 
baseband h/w in the ath card.  Unfortunately I've been trying to get 
Atheros to help understand/resolve this question for a very long time 
(as their code also exhibits this behaviour) but they've been 
unresponsive.  I have some experimental code to address this in new hal 
versions (such as 0.10.5.6 available in http://www.freebsd.org/~sam) but 
apparently it does not entirely fix the problem.

    Sam