FreeBSD TCP Behavior with Linux NAT

Christopher Penney cpenney at gmail.com
Thu Nov 11 14:43:13 UTC 2010


Hi,

I have a curious problem I'm hoping someone can help with or at least
educate me on.

I have several large Linux clusters and for each one we hide the compute
nodes behind a head node using NAT.  Historically, this has worked very well
for us and any time a NAT gateway (the head node) reboots everything
recovers within a minute or two of it coming back up.  This includes NFS
mounts from Linux and Solaris NFS servers, license server connections, etc.

Recently, we added a FreeBSD based NFS server to our cluster resources and
have had significant issues with NFS mounts hanging if the head node
reboots.  We don't have this happen much, but it does occasionally happen.
 I've explored this and it seems the behavior of FreeBSD differs a bit from
at least Linux and Solaris with respect to TCP recovery.  I'm curious if
someone can explain this or offer any workarounds.

Here are some specifics from a test I ran:

Before the reboot two Linux clients were mounting the FreeBSD server.  They
were both using port 903 locally.  On the head node clientA:903 was remapped
to headnode:903 and clientB:903 was remapped to headnode:601.  There is no
activity when the reboot occurs.  The head node takes a few minutes to come
back up (we kept it down for several minutes).

When it comes back up clientA and clientB try to reconnect to the FreeBSD
NFS server.  They both use the same source port, but since the head node's
conntrack table is cleared it's a race to see who gets what port and this
time clientA:903 appears as headnode:601 and clientB:903 appears as
headnode:903 ( >>> they essentially switch places as far as the FreeBSD
server would see <<< ).

The FreeBSD NFS server, since there was no outstanding acks it was waiting
on, thinks things are ok so when it gets a SYN from the two clients it only
responds with an ACK.  The ACK for each that it replies with is bogus
(invalid seq number) because it's using the return path the other client was
using before the reboot so the client sends a RST back, but it never gets to
the FreeBSD system since the head node's NAT hasn't yet seen the full
handshake (that would allow return packets).  The end result is a
"permanent" hang (at least until it would otherwise cleanup idle TCP
connections).

This is in stark contrast to the behavior of the other systems we have.
 Other systems respond to the SYN used to reconnect with a SYN/ACK.  They
appear to implicitly tear down the return path based on getting a SYN from a
seemingly already established connection.

I'm assuming this is one of the grey areas where there is no specific
behavior outlined in an RFC?  Is there any way to make the FreeBSD system
more reliable in this situation (like making it implicitly tear down the
return)?  Or is there a way to adjust the NAT setup to allow the RST to
return to the FreeBSD system?  Currently, NAT is setup with simply:

iptables -t nat -A POSTROUTING -s 10.1.0.0/16 -o bond0 -j SNAT --to 1.2.3.4

Where 1.2.3.4 is the intranet address and 10.1.0.0 is the cluster network.

Thanks!

    Chris (not a list subscriber -- please CC if you can)


More information about the freebsd-net mailing list