amd64/156408: Routing failure when using VLANs vs. Physical ethernet interfaces.

Thu Apr 14 21:00:25 UTC 2011

>Number:         156408
>Category:       amd64
>Synopsis:       Routing failure when using VLANs vs. Physical ethernet interfaces.
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    freebsd-amd64
>State:          open
>Quarter:        
>Keywords:       
>Date-Required:
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Thu Apr 14 21:00:19 UTC 2011
>Closed-Date:
>Last-Modified:
>Originator:     Thomas Johnson
>Release:        FreeBSD 8.2-RELEASE amd64
>Organization:
ClaimLynx, Inc.
>Environment:
System: FreeBSD jaguar-2.claimlynx.com 8.2-RELEASE FreeBSD 8.2-RELEASE #8: Sat Feb 26 21:23:00 CST 2011 root at jaguar-2.claimlynx.com:/usr/obj/usr/src/sys/GENERIC-CARP amd64

>Description:

I have discovered some odd routing behavior that seems to occur when VLANs are used as members of a bridge. Specifically, it seems that static routes do not function correctly.

Here is some background on the situation I have. I am building a new host to replace our aging (running 8.0) firewall. The new machine I am building has a single ethernet interface (re driver, but over the course of troubleshooting I've used sk and igb ethernet adapters), so I am using VLANs to segment traffic. The 'LAN' VLAN on my setup uses interface vlan500, with the 'WAN' on vlan200. The firewall also has an OpenVPN tunnel to our data center, operating in bridged mode on interface tap0. vlan500 and tap0 are both members of bridge0, allowing the LANs at our office and data center to talk on the same subnet, 172.31.0.0/16. 

In this configuration, I am able to connect from the office lan to hosts on the data center lan. The openvpn server at the datacenter (separate host from the firewall) pushes out a route for the dc production subnet upon connect. The logical configuration looks something like this:

(office lan)<->[vlan500|bridge0|tap0]<-vpn->(dc lan)<->[dc firewall]<->(dc production subnet)
               [      firewall      ]
[      common 172.31.0.0/16 subnet throughout      ]                   [ 100.100.100.128/26 ]

For the sake of reference, here are the relevant IP addresses:

172.31.0.252	- local firewall vlan500
172.31.0.254	- local firewall lan carp
172.31.5.1	- data center firewall

The problem seems to exist with the route to the production subnet at the data center. When the openvpn connection comes up, the route is installed in the routing table as expected. However, attempts to connect to hosts on this network result in instantaneous failure; not even a host unreachable. For example

~-> ping hostfoo
PING hostfoo.claimlynx.com (100.100.100.149): 56 data bytes
ping: sendto: Invalid argument

Here is the output of 'netstat -rn' on this host:

root at shawshank-1:~-> netstat -rn
Routing tables

Internet:
Destination        Gateway            Flags    Refs      Use  Netif Expire
default            10.8.20.1          UGS         4   124778 vlan20
172.31.0.0/16        link#12            U           3    56103 vlan50
172.31.0.252         link#12            UHS         0        0    lo0
172.31.0.254         link#13            UH          0        0 carp10
172.31.3.5           link#8             UHS         0        0    lo0
10.8.20.0/24       link#9             U           0       33 vlan20
10.8.20.252        link#9             UHS         0        0    lo0
10.8.20.254        link#14            UH          0        0 carp20
10.8.30.0/24       link#10            U           0        0 vlan30
10.8.30.252        link#10            UHS         0        0    lo0
10.8.30.254        link#15            UH          0        0 carp30
10.8.40.0/24       link#11            U           0        0 vlan40
10.8.40.252        link#11            UHS         0        0    lo0
127.0.0.1          link#7             UH          0        0    lo0
100.100.100.128/26   172.31.5.1           UGS         0    21466   tap0

Internet6:
Destination                       Gateway                       Flags      Netif Expire
::1                               ::1                           UH          lo0
fe80::%lo0/64                     link#7                        U           lo0
fe80::1%lo0                       link#7                        UHS         lo0
ff01:7::/32                       fe80::1%lo0                   U           lo0
ff02::%lo0/32                     fe80::1%lo0                   U           lo0

As you can see, the routing table shows the 172.31.0.0/16 subnet route on the vlan500 interface, and puts the 100.100.100.128/26 production subnet route on the tap0 interface. While troubleshooting this, my hunch was that perhaps the system was choking because the next-hop for the production route was on a network (172.31.0.0/16) that is not reachable via tap0 (in actuality it is). To test this, I inserted a host route for the next hop:

route add 172.31.5.1 -interface tap0

Adding this route resolves the condition, but it seems like a hacky fix. In comparison, the firewall that I am replacing uses the same lan/bridge/tap setup, but the machine has physical ethernet interfaces for all segments, rather than the vlans that my new setup uses. The existing setup works fine, without the need to add a host route. Here is the routing table for the existing firewall:

tom at shawshank:~-> netstat -rn
Routing tables

Internet:
Destination        Gateway            Flags    Refs      Use  Netif Expire
default            74.95.66.26        UGS         7  5043426   fxp2
172.31.0.0/16        link#2             U           4 70728235   fxp1
172.31.0.1           link#2             UHS         0  3870772    lo0
172.31.3.4           link#8             UHS         0        0    lo0
74.95.66.24/30     link#3             U           0     1243   fxp2
74.95.66.25        link#3             UHS         0        9    lo0
127.0.0.1          link#6             UH          0  1140570    lo0
192.168.50.0/24    link#1             U           0        0   fxp0
192.168.50.4       link#1             UHS         0        0    lo0
100.100.100.128/26   172.31.5.1           UGS         0    19877   fxp1

Internet6:
Destination                       Gateway                       Flags      Netif Expire
::1                               ::1                           UH          lo0
fe80::%lo0/64                     link#6                        U           lo0
fe80::1%lo0                       link#6                        UHS         lo0
ff01:6::/32                       fe80::1%lo0                   U           lo0
ff02::%lo0/32                     fe80::1%lo0                   U           lo0

The noteworthy difference between the two routing tables is that the production route on the old firewall is put on the LAN interface (fxp1).

>How-To-Repeat:

This situation occurs every time this host is booted.

>Fix:

The workaround I have found is to add a host route for the next-hop to the tap0 interface. This seems to work alright, but I want to make sure that this isn't a symptom of a bug in the vlan driver or elsewhere.

>Release-Note:
>Audit-Trail:
>Unformatted: