multi-homed systems stop answering ARP on local addresses w/ifconfig aliases

Sun May 17 20:25:40 UTC 2009

There seems to be a regression between 6.x and 7.0 and 7.1 related to 
ifconfig aliases on multi-homed hosts. Not sure on anything newer than 
7.1 (this is pfSense, we're just starting to test 7.2 builds). For 
periods of time, the system will stop answering ARP on some of its own 
addresses and hence anything on that network completely stops 
functioning. The same setup worked fine on 6.2.

The particular system illustrated here is a router on part of an ISP's 
network. IPs are all public, in the info provided here they've been 
replaced with 10. IPs. The subnets on the inside interfaces are routed 
to the outside interface. When this problem occurs, the IPs assigned 
locally on the system will still respond from the Internet, but the 
system itself loses all connectivity with that subnet and nothing on 
that subnet can communicate with the host due to the lack of ARP. That 
makes some sense, I presume when routing to a locally assigned address 
via another interface, the system doesn't need ARP on the address to 
respond. But while it still responds from the Internet, even the host 
itself can't initiate a ping to that IP. It behaves the same whether pf 
is enabled or disabled.

I see two similar issues in the past, one with a PR:
http://www.freebsd.org/cgi/query-pr.cgi?pr=121437&cat=
that's exactly the same issue, it's not limited to VLANs, any 
multi-homed host is affected.

And another:
http://thread.gmane.org/gmane.os.freebsd.stable/57125

fxp0 is the outside interface. It doesn't make any difference whether 
the ifconfig aliases are on the em0 or fxp1 interfaces, they both behave 
the same if they have any ifconfig aliases assigned.

# ifconfig
fxp0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=8<VLAN_MTU>
        ether 00:90:27:86:8b:9d
        inet6 fe80::290:27ff:fe86:8b9d%fxp0 prefixlen 64 scopeid 0x1
        inet 10.11.185.146 netmask 0xfffffff8 broadcast 10.11.185.151
        media: Ethernet 100baseTX <full-duplex>
        status: active
em0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=9b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM>
        ether 00:11:43:2c:62:03
        inet 10.10.0.1 netmask 0xffffff00 broadcast 10.10.0.255
        inet6 fe80::211:43ff:fe2c:6203%em0 prefixlen 64 scopeid 0x2
        inet 10.13.40.1 netmask 0xffffff00 broadcast 10.13.40.255
        inet 10.13.41.1 netmask 0xffffff00 broadcast 10.13.41.255
        inet 10.13.42.1 netmask 0xffffff00 broadcast 10.13.42.255
        inet 10.13.43.1 netmask 0xffffff00 broadcast 10.13.43.255
        inet 10.13.44.1 netmask 0xffffff00 broadcast 10.13.44.255
        inet 10.13.45.1 netmask 0xffffff00 broadcast 10.13.45.255
        inet 10.13.46.1 netmask 0xffffff00 broadcast 10.13.46.255
        inet 10.13.47.1 netmask 0xffffff00 broadcast 10.13.47.255
        media: Ethernet autoselect (100baseTX <full-duplex>)
        status: active
fxp1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=8<VLAN_MTU>
        ether 00:d0:b7:5d:25:9f
        inet 10.1.242.1 netmask 0xffffff00 broadcast 10.1.242.255
        inet6 fe80::2d0:b7ff:fe5d:259f%fxp1 prefixlen 64 scopeid 0x3
        inet 10.1.243.1 netmask 0xffffff00 broadcast 10.1.243.255
        media: Ethernet autoselect (100baseTX <full-duplex>)
        status: active

When the problem is occurring, you can't even ping the affected locally 
assigned addresses from the box itself:
# ping 10.10.0.1
PING 10.10.0.1 (10.10.0.1): 56 data bytes
ping: sendto: Network is unreachable
ping: sendto: Network is unreachable
ping: sendto: Network is unreachable

And when trying to ping something on one of the affected attached 
subnets, you get:
# ping 10.10.0.30
PING 10.10.0.30 (10.10.0.30): 56 data bytes
ping: sendto: Invalid argument
ping: sendto: Invalid argument

In the logs, you get a flood of these messages:
May 14 02:55:12    kernel: arpresolve: can't allocate route for 10.10.0.1
May 14 02:55:12    kernel: arplookup 10.10.0.1 failed: host is not on 
local network
May 14 02:55:12    kernel: arpresolve: can't allocate route for 10.10.0.1
May 14 02:55:12    kernel: arplookup 10.10.0.1 failed: host is not on 
local network

It happens both with the primary IP assigned to the interface, and the 
aliases assigned, but not all at once. Some of the addresses will 
continue to work when others are failing. Somehow it thinks IPs that are 
locally assigned are not on a local network... after a couple minutes, 
it just starts working again without making any changes or even touching 
the system.

If I can provide any additional information, please let me know.

thanks,
Chris