Weird "ignoring syn" problem

David Cornejo dave at dogwood.com
Mon Jun 25 21:13:55 UTC 2007


At 08:27 AM 6/25/2007, Bill Moran wrote:
>In response to Bill Moran <wmoran at collaborativefusion.com>:
>
> > In response to Adam McDougall <mcdouga9 at egr.msu.edu>:
> >
> > > On Tue, Jun 12, 2007 at 10:19:49AM -0400, Bill Moran wrote:
> > >
> > >
> > >   This one has got me pretty befuddled.
> > >
> > >   We're seeing some really odd behaviour with FreeBSD ignoring 
> SYN packets.
> > >   I've been trying to diagnose this for a couple of weeks now, 
> and my current
> > >   guess is that there's something wrong with the em 
> driver.  Here's a narrowed
> > >   down list of what I've ruled out:
> > >   *) I've done my best to eliminate other network components as 
> the problem.
> > >      My theory at this point is that it can't possibly be any 
> other network
> > >      hardware, based on the tcpdump show below.
> > >   *) The problem occurred on both FreeBSD 6.1 and FreeBSD 6.2-p3.
> > >   *) The problem does not appear to be tied to CPU usage -- the 
> CPU is nearly
> > >      idle when the problem occurs.
> > >   *) I can now reproduce it pretty easily, so I'll know when it's fixed.
> > >   *) The system exhibiting the problem is running 15 jails, but they are
> > >      idle 95% of the time.  The problem initially occurred inside one of
> > >      the jails, but I just recreated it outside the jail (on 
> the host) and
> > >      it's _easier_ to reproduce outside the jail.
> > >   *) The problem occurred with both GENERIC, and the SMP kernel 
> (this is a
> > >      dual-CPU, hyperthreaded system)
> > >   *) I've tested and the behavior occurs both with a 
> dynamically generated
> > >      file (from PHP) or from a static file.
> > >
> > >   The nature of the beast is that we've got a SOAP application 
> running under
> > >   Apache and PHP.  This application is subject to many requests in rapid
> > >   succession, such that load can be simulated by the following loop:
> > >
> > >   while true; do fetch http://192.168.121.250/test.php; done
> > >
> > >   The problem is that occasionally, the Apache server machine 
> just ignores
> > >   SYN packets.  Take the following tcpdump output for example:
> > >
> > >   13:34:17.312296 IP 
> web04-v100.cust00.pitbpa1.priv.collaborativefusion.com.54808 > 
> anchor-is00.is.pitbpa1.priv.collaborativefusion.com.http: S 
> 2645061726:2645061726(0) win 65535 <mss 1380,nop,wscale 
> 1,nop,nop,timestamp 2690201156 0,sackOK,eol>
> > >   13:34:20.312398 IP 
> web04-v100.cust00.pitbpa1.priv.collaborativefusion.com.54808 > 
> anchor-is00.is.pitbpa1.priv.collaborativefusion.com.http: S 
> 2645061726:2645061726(0) win 65535 <mss 1380,nop,wscale 
> 1,nop,nop,timestamp 2690204156 0,sackOK,eol>
> > >   13:34:23.512626 IP 
> web04-v100.cust00.pitbpa1.priv.collaborativefusion.com.54808 > 
> anchor-is00.is.pitbpa1.priv.collaborativefusion.com.http: S 
> 2645061726:2645061726(0) win 65535 <mss 1380,nop,wscale 
> 1,nop,nop,timestamp 2690207356 0,sackOK,eol>
> > >
> > >   This is the _only_ traffic on port 80 during the test.  It 
> looks like the
> > >   kernel has ignored the initial syn packet and two 
> duplicates.  I've seen it
> > >   take as long as 45 seconds to establish a connection, and this causes
> > >   ugly performance problems, as well as frequent timeouts on 
> the client end.
> > >   The only clue I've found so far is this output from netstat -s.
> > >
> > >
> > > Does the Apache server have a firewall of any sort?  (Could be 
> making unexpected
> > > decisions there, even not part of a fw rule)
> > >
> > > Try net.inet.ip.portrange.randomized=0 on the client?  (If this 
> is the problem,
> > > we would probably see a reused port if you had a tcpdump of a few minutes
> > > if started after waiting for several minutes of "silence")
> > >
> > > Are both systems on the same subnet?  If not, can/have you tried that?
> >
> > No, they aren't.  My ability to test on the same subnet is limited and
> > the results inconclusive.
> >
> > > Can you show tcpdump output using -e on the requests that aren't answered
> > > as well as an example that IS answered?  (I have seen routers 
> mess up the MAC
> > > addresses for the source and destination and if I kept staring at layer 3
> > > data all day I might never have seen the problem)
> > >
> > > Better yet, can you post files containing tcpdump output using 
> -w of an entire
> > > session that ideally contains failed attempts that eventually 
> work?  That way
> > > people could look at a broader picture and perhaps pick up on 
> something subtle.
> > > Its worth comparing a SYN that works, directly with a SYN that 
> doesn't work.
> >
> > We've decided to swap the card out on Friday and see if that resolves the
> > problem.  We have similar units that don't exhibit the problem, so I'm
> > getting pretty suspicious that this might be a flaky NIC.  If the new
> > card doesn't solve the problem, I'll post more details on Monday.
>
>Just in case someone was curious as to the result, or finds this on a web
>search.
>
>The behaviour was apparently hardware related.  We swapped the NIC out and
>can no longer reproduce the problem.

To follow up on my situation - Over the weekend I took the Soekris 
box that demonstrated the bad TCP checksums and wiped then 
reinstalled the same vintage CURRENT and the problem disappeared.  I 
used the same kernel config in both cases.

Thanks to those who replied...

dave c



More information about the freebsd-net mailing list