ng_one2many v.s. AFT (NIC Fault Tolerance/Fail Over/Redundancy Revisited)

Jonathan Donaldson donaldson at cisco.com
Mon Oct 17 05:52:54 PDT 2005


Brian,

Thanks for this excellent stream of thoughts, it is a lot like what I  
am trying to accomplish...Please see my comments and questions in-line:


On Oct 15, 2005, at 7:25 PM, Brian A. Seklecki wrote:

> Re: http://lists.freebsd.org/pipermail/freebsd-questions/2005- 
> October/100623.html
>
> First: This is all very preliminary from some testing over the  
> weekend.
>
> Dell's reponse was that Intel's AFT/ALB was entirely software based.
>
> That left me with few options:
> 1) Try userland layer 3 failover (ugly)
> 2) Use ng_one2many
>
> However, ng_one2many only permits for two algorithms:  
> NG_ONE2MANY_XMIT_ROUNDROBIN and NG_ONE2MANY_XMIT_ALL.
>
> However, none of these meet the need:
> - Round-Robin results in 50% packet loss if a hook/interface is  
> lost (not acceptable in any mission critical environment).
> - Xmit-All causes twice as much load on to be placed on the switch / 
> fabric and switch CPU.
>

<jonathan> As of Freebsd 6_0 (which is at RC1 now), the NG_ONE2MANY  
does support the failure of a link which does not end up with 50%  
packet loss. There is new code in the One2Many module that xmits a  
layer 2 "I'm alive" broadcast out all links, as long as this is  
picked up on the other links, then all interfaces are considered  
alive. If one of the packets is not received, then after 2 x  
heartbeat duration that link is considered "down". I have tested this  
in the 6.0 code and it works with one caveat. When the server is  
brought up, both interfaces must be connected and live, or for some  
reason, the failure algorithm never seems to kick in. I saw exactly  
what you saw in 5.4 and newer with regards to the 50% packet loss.</ 
jonathan>


> What ng_one2many needs is a "Active-Standy" XMIT algorithm (STP  
> BOFH's will think BLOCKING/FORWARDING).  It could even be used on  
> top of other NetGraph nodes like ng_fec or possibly (hopefully)  
> ng_802.3ad >:}
>

<jonathan> I agree </jonathan>

> Essentially, a single layer 3 IP address needs to be visible in a  
> "switch fault tolerant" or "adapter fault tolerant" configuration.   
> A userland-level daemon could be scripted, and it has been done  
> before:
>
> http://lists.freebsd.org/pipermail/freebsd-isp/2003-November/ 
> 001314.html
>
> So when a fail-over occurs, the layer IP 3 address moves from one  
> layer 2 MAC address to another layer 2 MAC address on the same  
> machine (and same subnet, same ethernet segment, just a different  
> interface).  TCP sockets should not be affected due to layer  
> abstraction.
>
> This got me thinking about HSRP/VRRP.  That protocol is designed  
> strictly to move a layer 3 address between two different hosts.    
> Excellent applications are Router/Firewall and VPN concentrator, as  
> OpenBSD's carp(4) has implemented with the help of pfsync.  I was  
> experimenting with the OpenBSD variant and I realized that client  
> hosts weren't seeing the usual warnings about MAC address changes.
>
> As of 3.7, OpenBSD's CARP shares a virtual MAC address between the  
> hosts, Cisco's HSRP does not.
>

<jonathan> Will CARP work between two interfaces on the same server?  
I always thought it was positioned for two separate devices. If it  
did work on the one server, the only down side I can really see is  
having to have 3 ip addresses for each 1 real address you want. Also,  
there might be some issues with getting this to work with Jails </ 
jonathan>

> Then I was thinking about the OpenBSD/NetBSD bridge(4) interface.   
> If the host acting as the bridge wishes too, it can participate in  
> the bridged networks by assigning a layer 3 address.  The address  
> isn't ifconfig(8)'d do the "bridge0" interface.  Instead, it's  
> assigned to the first interface included in the "bridge[0-9]", say  
> fxp0.
>
> Further more, regardless of what network segment/port a host  
> participating in a bridge(4)'d network resides, the ARP'd IP  
> address of the OpenBSD/NetBSD host is persistently the MAC first  
> physical interface ifconfig(8)'d with the IP.
>
> Plus OpenBSD/NetBSD bridge(4) supports 802.1d spanning tree >:}
>
> This is important.  Spanning Tree as an alogirth could provide  
> Intel AFT "Fault Tolerance" intelligence if the persistent layer2  
> address of a host was unchanged with the NIC interface change.  The  
> function of STP is to provide a loop free path to every layer2 MAC  
> in a segment.  But a STP enabled bridge(4) with an IP address  
> assigned has a persistent MAC address associated with a layer 3  
> address!
>

<jonathan> We tried the NG_bridge option with 5.4, and while it did  
seem to work, it's failover was a bit slower than we would have  
liked, and there were some other quirks we saw in NG_bridge, but I  
should go back to look at the built-in bridge function. The other  
issue I see here is that we will have 6 servers connected this way,  
and if we have a link failure, what will happen to the rest during  
STP convergence...</jonathan>

> Therefore, the solution has been there all along.  The attached  
> diagram explains in greater detail.
>
> http://digitalfreaks.org/~lavalamp/OpenBSD_Bridge_AFT.png
>
> In this diagram, switch 0 is configured manually as the spanning  
> tree root and switch 1 is the backup spanning tree root.  By  
> default, rl0 will be in BLOCKING and rl1 will being FORWARDING.   
> However, as tcpdump(8) illustrates, regardless of which interface  
> is the root port, ARP replys will always return the MAC if the  
> bridge(4) member interface ifconfig(8)'d with the IP.
>
> rl0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> mtu  
> 1500
>     address: 00:50:fc:9d:24:d6
>     media: Ethernet autoselect (100baseTX full-duplex)
>     status: active
>     inet 192.168.100.1 netmask 0xffffff00 broadcast 192.168.100.255
>
> rl1: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> mtu  
> 1500
>     address: 00:50:fc:9d:08:cd
>     media: Ethernet autoselect (100baseTX full-duplex)
>     status: active
>
> ---
>
> bridge0: flags=41<UP,RUNNING>
>     Configuration:
>         priority 32768 hellotime 2 fwddelay 15 maxage 20
>     Interfaces:
>         rl1 flags=b<LEARNING,DISCOVER,STP>
>             port 2 ifpriority 128 ifcost 55 forwarding
>         rl0 flags=b<LEARNING,DISCOVER,STP>
>             port 1 ifpriority 128 ifcost 55 blocking
>     Addresses (max cache: 100, timeout: 240):
>         00:01:63:bb:f7:c9 rl1 1 flags=0<>
>         00:0f:1f:c1:f2:b7 rl1 1 flags=0<>
> -----
> # tcpdump -i rl1 -n arp
> 12:38:17.806885 arp who-has 192.168.100.1 tell 192.168.100.254
> 12:38:17.806951 arp reply 192.168.100.1 is-at 0:50:fc:9d:24:d6
> 12:38:17.806966 arp reply 192.168.100.1 is-at 0:50:fc:9d:24:d6
>
> bs0#sh spanning-tree vlan 11 interface fa0/9
>
> Spanning tree 11 is executing the IEEE compatible Spanning Tree  
> protocol
>   Bridge Identifier has priority 100, address 0001.63bb.f7c2
>   Configured hello time 2, max age 20, forward delay 15
>   We are the root of the spanning tree
>   Topology change flag not set, detected flag not set, changes 54
>   Times:  hold 1, topology change 35, notification 2
>           hello 2, max age 20, forward delay 15
>   Timers: hello 0, topology change 0, notification 0
>
> Interface Fa0/9 (port 22) in Spanning tree 11 is FORWARDING
>    Port path cost 19, Port priority 128
>    Designated root has priority 100, address 0001.63bb.f7c2
>    Designated bridge has priority 100, address 0001.63bb.f7c2
>    Designated port is 22, path cost 0
>    Timers: message age 0, forward delay 0, hold 0
>    BPDU: sent 10592, received 30
>
>
> bs1#sh spanning-tree vlan 11 interface fa0/9
>
> Spanning tree 11 is executing the IEEE compatible Spanning Tree  
> protocol
>   Bridge Identifier has priority 32768, address 0002.fd0e.f382
>   Configured hello time 2, max age 20, forward delay 15
>   Current root has priority 100, address 0001.63bb.f7c2
>   Root port is 38, cost of root path is 19
>   Topology change flag not set, detected flag not set, changes 54
>   Times:  hold 1, topology change 35, notification 2
>           hello 2, max age 20, forward delay 15
>   Timers: hello 0, topology change 0, notification 0
>
> Interface Fa0/9 (port 22) in Spanning tree 11 is FORWARDING
>    Port path cost 19, Port priority 128
>    Designated root has priority 100, address 0001.63bb.f7c2
>    Designated bridge has priority 32768, address 0002.fd0e.f382
>    Designated port is 22, path cost 19
>    Timers: message age 0, forward delay 0, hold 0
>    BPDU: sent 45454, received 1196
>
> bs0#sh mac-address-table | include 24d6
> 0050.fc9d.24d6       Dynamic         11  FastEthernet0/9
>
> bs1#sh mac-address-table | include 24d6
> 0050.fc9d.24d6       Dynamic         11  FastEthernet0/24
>
> The behavior is similar in FreeBSD using ng_bridge(4) (I haven't  
> tried FreeBSD bridge(4)).  However, both of these claim "a  
> privative loop prevention algorithm"); ... 'debug stp events' shows  
> no STP traffic from a FreeBSD host, though.
>
> Also, FreeBSD differs in behavior in that the MAC address ARP'd is  
> that of which ever NG node bridge member is assigned the IP.
>
> The disadvantage is that without FreeBSD speaking 802.1d, it can't  
> know to fail an interface on any event other than a media state  
> change. i.e., the currently active port could be connected a switch  
> that looses it's uplink. Of course, neither the FreeBSD or NetBSD/ 
> OpenBSD implementation features a "heartbeat" algorithm to add  
> intelligence, as Intel AFT/ALB might, but that wasn't the design  
> principal goal.
>
> Also, my initial tests are with managed switches using PVST.   
> Behavior may differ with unmanaged switches where no STP debugging  
> is possible or possibly a uni-stp is used.
>
> More on this on Monday...
>
> http://www.cisco.com/application/pdf/en/us/guest/netsol/ns304/c649/ 
> cdccont_0900aecd800ea162.pdf
>
> ~BAS
>


More information about the freebsd-questions mailing list