Possible CARP bug?

Viktor Petersson petersson at gmail.com
Fri Mar 18 16:12:52 UTC 2011


Hey guys,

First, a big thanks to the developers for all the hard work. You guys rock!

Now to the issue. I've been using CARP on a few servers in the past without any issues. It usually works without any hick-ups. Now I'm planning to move our company's infrastructure from physical hardware to a virtual environment over at CloudSigma (http://www.cloudsigma.com). Unfortunately I'm having some issues with getting CARP to work there. For the record, they're using Qemu as the virtualization platform.

Let me start by describing my setup in more details.

I have two nodes: nas0 and nas1. Both these nodes have two interfaces, one public and one private. I'm obviously using the private one for CARP. nas0 is using the IP 192.168.1.11 and nas1 is using the IP 192.168.1.12. The CARP interface is configured to use the IP 192.168.1.10. The internal network is using a dedicated VLAN. Only these two nodes are using this VLAN to eliminate any possible conflicts.

I've also disabled all software firewalls, so we should also be able to exclude that from the equation.

Both nodes are using FreeBSD 8.2, and both the internal and external interfaces are working (ie. the two nodes can ping each other on the private interfaces).

In rc.conf on nas0, I have the following lines:
	cloned_interfaces="carp0"
	ifconfig_carp0="vhid 1 pass foobar 192.168.1.10/24"

On nas1 (which is the failover), the equivalent lines are:
	cloned_interfaces="carp0"
	ifconfig_carp0="vhid 1 advskew 100 pass foobar 192.168.1.10/24"
(note the advskew value on nas1)

To verify that CARP is enabled and configured etc., here's the sysctl output (same on both nodes):
	net.inet.carp.allow: 1
	net.inet.carp.preempt: 1
	net.inet.carp.log: 1
	net.inet.carp.arpbalance: 0
	net.inet.carp.suppress_preempt: 0

Normally, that should be it. nas0 should automatically become the master, and nas1 the backup/failover. Unfortunately that doesn't happen. Instead, what I get this on the node with the lowest advskew value (nas0, but if I raise the advskew on nas0, the error moves to nas1):
	Mar  7 14:42:57 nas0 kernel: carp0: MASTER -> BACKUP (more frequent advertisement received)
	Mar  7 14:42:57 nas0 kernel: carp0: 2 link states coalesced
	Mar  7 14:42:57 nas0 kernel: carp0: link state changed to DOWN

When checking the CARP interface status, I get the following on nas0:
	carp0: flags=49<UP,LOOPBACK,RUNNING> metric 0 mtu 1500
       	inet 192.168.1.10 netmask 0xffffff00
       	carp: BACKUP vhid 1 advbase 1 advskew 0

and the following on nas1:
	carp0: flags=49<UP,LOOPBACK,RUNNING> metric 0 mtu 1500
       	inet 192.168.1.10 netmask 0xffffff00
       	carp: BACKUP vhid 1 advbase 1 advskew 100

I've google'd this error (carp0: 2 link states coalesced), and some of the forum posts mentioned that they've seen this with faulty NICs or switches. However, I've reached out to CloudSigma, and they've been very helpful and set up a replication of the setup, but on 8.1). Their head network guy was able to reproduce the same errors as I got, and he was also able to confirm that the packages were indeed sent and received on both nodes (using tcpdump). His conclusion was that this was likely a bug in CARP (or possibly a driver). 

It is also worth mentioning that CARP does work under OpenBSD 4.3 and VRRT work under Linux. 

Since it's also in their interest to get this working for us (as this is what is holding us back from moving), they've been kind enough to provide access to their CARP test-nodes to any developer that want to take a stab at it. I have the credentials and details, but I don't want to post them here, but will provide them to anyone interested.

Regards,
Viktor Petersson
WireLoad


More information about the freebsd-net mailing list