[Bug 258850] lagg failover crashes and burns out with em and ath

From: <bugzilla-noreply_at_freebsd.org>
Date: Sat, 02 Oct 2021 00:10:26 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=258850

            Bug ID: 258850
           Summary: lagg failover crashes and burns out with em and ath
           Product: Base System
           Version: 13.0-RELEASE
          Hardware: amd64
                OS: Any
            Status: New
          Severity: Affects Only Me
          Priority: ---
         Component: kern
          Assignee: bugs@FreeBSD.org
          Reporter: john.westbrook@gmail.com

I am having significant problems on FreeBSD 13.0 using lagg-failover with em0
and wlan0/ath0 on both my ThinkPad X220 and X230. Both laptops are running
Coreboot, with a Dell 7WCGT Bigfoot Killer Wireless (AR5BHB112; AR9380
chipset). Both em0 and wlan0/ath0 work fine when not used with lagg.

This problem has some similarities to bug #226549 but can't be recovered in the
same way.

The basic symptom is that the lagg0 interface often vanishes when both laggport
interfaces are inactive/unassociated--for example, (1) when not connected to
wired ethernet and the WiFi interface loses its association with the WiFi
access point, or (2) when unplugging from the wired network. This also often
happens at boot, when the lagg0 interface comes up but WiFi hasn't established
an association with the WiFi access point. Looking in dmesg after boot doesn't
shed much light:

lagg0: link state changed to DOWN
lagg0: link state changed to UP
lagg0: link state changed to DOWN

However, the problem isn't limited to WiFi. The problem also occurs when
failing over from wired. Once em0 goes down (i.e. cable unplugged, or ifconfig
down), it can't be brought back up, even separate from lagg0:

# ifconfig em0
em0: flags=8c22<BROADCAST,OACTIVE,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=800000<>
        ether XX:XX:XX:XX:XX:XX
        media: Ethernet autoselect (1000baseT <full-duplex>)
        status: active
        nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
# ifconfig em0 up
# ifconfig em0
em0: flags=8c22<BROADCAST,OACTIVE,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=800000<>
        ether XX:XX:XX:XX:XX:XX
        media: Ethernet autoselect
        status: no carrier
        nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
# ifconfig em0
em0: flags=8c22<BROADCAST,OACTIVE,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=800000<>
        ether XX:XX:XX:XX:XX:XX
        media: Ethernet autoselect (1000baseT <full-duplex>)
        status: active
        nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>

Here's my lagg configuration--almost identical to the man page:

wlans_ath0="wlan0"
ifconfig_wlan0="WPA"
ifconfig_em0="up"
cloned_interfaces="lagg0"
ifconfig_lagg0="up laggproto failover laggport em0 laggport wlan0 DHCP"

except that I'm setting the MAC address via a hint in /boot/loader.conf:

hint.ath.0.macaddr="XX:XX:XX:XX:XX:XX"

I used the hint based on past threads discussing problems associated with
setting the MAC address on Atheros devices. However, it doesn't seem to make a
difference with the problem if I instead override the MAC address on em0 with
the MAC address from the Atheros card. Also, the problem with lagg0 happens
both when using DHCP and when configured to use a static IP address.

When not connected to wired ethernet, and when the WiFi interface
stabilizes/associates, reconfiguring lagg0 from the command line is flaky.
Sometimes it works, sometimes not. Sometimes ifconfig shows lagg0 along with a
device-not-configured error, followed by lagg0 vanishing:

# ifconfig wlan0 down
# ifconfig
em0: flags=8c23<UP,BROADCAST,OACTIVE,SIMPLEX,MULTICAST> metric 0 mtu 1500      
options=481249b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,LRO,WOL_MAGIC,VLAN_HWFILTER,NOMAP>
        ether XX:XX:XX:XX:XX:XX
        media: Ethernet autoselect
        status: no carrier
        nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> metric 0 mtu 16384
        options=680003<RXCSUM,TXCSUM,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6>
        inet6 ::1 prefixlen 128
        inet6 fe80::1%lo0 prefixlen 64 scopeid 0x2
        inet 127.0.0.1 netmask 0xff000000
        groups: lo
        nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
wlan0: flags=8802<BROADCAST,SIMPLEX,MULTICAST> metric 0 mtu 1500
        ether XX:XX:XX:XX:XX:XX
        groups: wlan
        ssid "" channel 1 (2412 MHz 11g ht/20)
        regdomain 106 indoor ecm authmode WPA2/802.11i privacy ON
        deftxkey UNDEF AES-CCM 2:128-bit txpower 20 bmiss 7 scanvalid 60
        protmode CTS ampdulimit 64k ampdudensity 8 shortgi -uapsd wme burst
        roaming MANUAL
        parent interface: ath0
        media: IEEE 802.11 Wireless Ethernet autoselect (autoselect)
        status: no carrier
        nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
pflog0: flags=141<UP,RUNNING,PROMISC> metric 0 mtu 33160
        groups: pflog
lagg0: flags=8802<BROADCAST,SIMPLEX,MULTICAST>
        ether XX:XX:XX:XX:XX:XX
ifconfig: SIOCGIFGROUP: Device not configured

# ifconfig lagg0 create
# ifconfig lagg0 up laggproto failover laggport wlan0 laggport em0
# ifconfig
em0: flags=8c22<BROADCAST,OACTIVE,SIMPLEX,MULTICAST> metric 0 mtu 1500
options=481249b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,LRO,WOL_MAGIC,VLAN_HWFILTER,NOMAP>
        ether XX:XX:XX:XX:XX:XX
        media: Ethernet autoselect
        status: no carrier
        nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> metric 0 mtu 16384
        options=680003<RXCSUM,TXCSUM,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6>
        inet6 ::1 prefixlen 128
        inet6 fe80::1%lo0 prefixlen 64 scopeid 0x2
        inet 127.0.0.1 netmask 0xff000000
        groups: lo
        nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
wlan0: flags=8802<BROADCAST,SIMPLEX,MULTICAST> metric 0 mtu 1500
        ether XX:XX:XX:XX:XX:XX
        groups: wlan
        ssid "" channel 1 (2412 MHz 11g ht/20)
        regdomain 106 indoor ecm authmode WPA2/802.11i privacy ON
        deftxkey UNDEF AES-CCM 2:128-bit txpower 20 bmiss 7 scanvalid 60
        protmode CTS ampdulimit 64k ampdudensity 8 shortgi -uapsd wme burst
        roaming MANUAL
        parent interface: ath0
        media: IEEE 802.11 Wireless Ethernet autoselect (autoselect)
        status: no carrier
        nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
pflog0: flags=141<UP,RUNNING,PROMISC> metric 0 mtu 33160
        groups: pflog
lagg0: flags=8802<BROADCAST,SIMPLEX,MULTICAST>
        ether XX:XX:XX:XX:XX:XX
ifconfig: SIOCGIFGROUP: Device not configured

Repeating the same operations sometimes yields success. I wrote a script that
helps with making sense the sequence in /var/log/messages:

#!/bin/sh

tag=`basename "$0"`

logger -t "$tag" "Checking lagg0 ..."
if ifconfig lagg0; then
    logger -t "$tag" "lagg0 exists."
    exit 0
fi

logger -t "$tag" "Creating lagg0 ..."
if ifconfig lagg0 create; then
    logger -t "$tag" "lagg0 create success."
else
    logger -t "$tag" "lagg0 create failed."
    exit 1
fi

logger -t "$tag" "Configuring lagg0 ..."
params=`sysrc -n ifconfig_lagg0 | sed s/DHCP/up/`
if ifconfig lagg0 $params; then
    logger -t "$tag" "lagg0 config success."
else
    logger -t "$tag" "lagg0 config failed: $params"
    exit 2
fi

logger -t "$tag" "Postcheck(0) lagg0 ..."
if ifconfig lagg0; then
    logger -t "$tag" "lagg0 postcheck success."
else
    logger -t "$tag" "lagg0 postcheck failed."
    exit 3
fi

sleep 10
logger -t "$tag" "Postcheck(1) lagg0 ..."
if ifconfig lagg0; then
    logger -t "$tag" "lagg0 postcheck success."
else
    logger -t "$tag" "lagg0 postcheck failed."
    exit 4
fi

sleep 20
logger -t "$tag" "Postcheck(2) lagg0 ..."
if ifconfig lagg0; then
    logger -t "$tag" "lagg0 postcheck success."
else
    logger -t "$tag" "lagg0 postcheck failed."
    exit 5
fi

Here's an example of when the script succeeds:

Oct  1 10:27:08 x220a fix-lagg0[6783]: Checking lagg0 ...
Oct  1 10:27:08 x220a fix-lagg0[6788]: Creating lagg0 ...
Oct  1 10:27:08 x220a fix-lagg0[6793]: lagg0 create success.
Oct  1 10:27:08 x220a fix-lagg0[6797]: Configuring lagg0 ...
Oct  1 10:27:09 x220a wpa_supplicant[347]: wlan0: CTRL-EVENT-DISCONNECTED
bssid=AA:AA:AA:AA:AA:AA reason=3 locally_generated=1
Oct  1 10:27:09 x220a kernel: lagg0: link state changed to DOWN
Oct  1 10:27:09 x220a kernel: wlan0: link state changed to DOWN
Oct  1 10:27:10 x220a fix-lagg0[6822]: lagg0 config success.
Oct  1 10:27:10 x220a fix-lagg0[6826]: Postcheck(0) lagg0 ...
Oct  1 10:27:10 x220a fix-lagg0[6831]: lagg0 postcheck success.
Oct  1 10:27:16 x220a wpa_supplicant[347]: wlan0: Trying to associate with
AA:AA:AA:AA:AA:AA (SSID='FiOS-YLLQU-5G' freq=5765 MHz)
Oct  1 10:27:16 x220a kernel: ath0: ath_edma_recv_tasklet: sc_inreset_cnt > 0;
skipping
Oct  1 10:27:16 x220a wpa_supplicant[347]: Failed to add supported operating
classes IE
Oct  1 10:27:16 x220a wpa_supplicant[347]: ioctl[SIOCS80211, op=20, val=0,
arg_len=7]: Can't assign requested address
Oct  1 10:27:16 x220a wpa_supplicant[347]: wlan0: Associated with
AA:AA:AA:AA:AA:AA
Oct  1 10:27:16 x220a kernel: wlan0: ieee80211_new_state_locked: pending AUTH
-> ASSOC transition lost
Oct  1 10:27:16 x220a kernel: wlan0: ieee80211_new_state_locked: pending ASSOC
-> RUN transition lost
Oct  1 10:27:16 x220a kernel: wlan0: link state changed to UP
Oct  1 10:27:16 x220a kernel: lagg0: link state changed to UP
Oct  1 10:27:16 x220a wpa_supplicant[347]: wlan0: WPA: Key negotiation
completed with AA:AA:AA:AA:AA:AA [PTK=CCMP GTK=CCMP]
Oct  1 10:27:16 x220a wpa_supplicant[347]: wlan0: CTRL-EVENT-CONNECTED -
Connection to AA:AA:AA:AA:AA:AA completed [id=0 id_str=]
Oct  1 10:27:20 x220a fix-lagg0[6852]: Postcheck(1) lagg0 ...
Oct  1 10:27:20 x220a fix-lagg0[6857]: lagg0 postcheck success.
Oct  1 10:27:50 x220a fix-lagg0[6878]: Postcheck(2) lagg0 ...
Oct  1 10:27:50 x220a fix-lagg0[6883]: lagg0 postcheck success.
Oct  1 10:27:51 x220a dhclient[6935]: New IP Address (lagg0): 192.168.1.86
Oct  1 10:27:52 x220a dhclient[6939]: New Subnet Mask (lagg0): 255.255.255.0
Oct  1 10:27:52 x220a dhclient[6943]: New Broadcast Address (lagg0):
192.168.1.255
Oct  1 10:27:52 x220a dhclient[6947]: New Routers (lagg0): 192.168.1.1

Notice that adding wlan0 as a laggport brings wlan0 down and triggers a
reassociation. Destroying lagg0 also takes down wlan0 and triggers a
reassociation:

Oct  1 10:32:30 x220a wpa_supplicant[347]: wlan0: CTRL-EVENT-DISCONNECTED
bssid=AA:AA:AA:AA:AA:AA reason=3 locally_generated=1
Oct  1 10:32:33 x220a kernel: wlan0: link state changed to DOWN
Oct  1 10:32:33 x220a kernel: lagg0: link state changed to DOWN
Oct  1 10:32:33 x220a dhclient[6925]: Interface lagg0 is down, dhclient exiting
Oct  1 10:32:33 x220a dhclient[6925]: connection closed
Oct  1 10:32:33 x220a dhclient[6925]: exiting.
Oct  1 10:32:33 x220a root[7331]: /etc/rc.d/netif: WARNING: lagg0 does not
exist.  Skipped.
Oct  1 10:32:40 x220a wpa_supplicant[347]: wlan0: Trying to associate with
AA:AA:AA:AA:AA:AA (SSID='FiOS-YLLQU-5G' freq=5765 MHz)
Oct  1 10:32:40 x220a wpa_supplicant[347]: Failed to add supported operating
classes IE
Oct  1 10:32:40 x220a wpa_supplicant[347]: ioctl[SIOCS80211, op=20, val=0,
arg_len=7]: Can't assign requested address
Oct  1 10:32:50 x220a wpa_supplicant[347]: wlan0: Authentication with
AA:AA:AA:AA:AA:AA timed out.
Oct  1 10:32:50 x220a wpa_supplicant[347]: wlan0: CTRL-EVENT-DISCONNECTED
bssid=AA:AA:AA:AA:AA:AA reason=3 locally_generated=1
Oct  1 10:32:57 x220a wpa_supplicant[347]: wlan0: Trying to associate with
AA:AA:AA:AA:AA:AA (SSID='FiOS-YLLQU-5G' freq=5765 MHz)
Oct  1 10:32:57 x220a wpa_supplicant[347]: Failed to add supported operating
classes IE
Oct  1 10:32:57 x220a wpa_supplicant[347]: wlan0: Associated with
AA:AA:AA:AA:AA:AA
Oct  1 10:32:57 x220a kernel: wlan0: link state changed to UP
Oct  1 10:32:57 x220a wpa_supplicant[347]: wlan0: WPA: Key negotiation
completed with AA:AA:AA:AA:AA:AA [PTK=CCMP GTK=CCMP]
Oct  1 10:32:57 x220a wpa_supplicant[347]: wlan0: CTRL-EVENT-CONNECTED -
Connection to AA:AA:AA:AA:AA:AA completed [id=0 id_str=]

The transcripts above are from my X220, but I've had the same symptoms on my
X230. Given that the problem happens on two machines and impacts both laggport
interfaces (em0 and WiFi), it seems like a lagg-related issue.

-- 
You are receiving this mail because:
You are the assignee for the bug.