Diagnose co-location networking problem
Matthew Hudson
fbsd at synoptic.org
Wed Dec 27 14:45:33 PST 2006
On Tue, Dec 26, 2006 at 06:45:39PM -0800, Stephan Wehner wrote:
> So I am thinking the problem may be with the co-location operation.
>
> How can I make sure? How can I diagnose this? The only idea I had was
> to run tcpdump on my Linux client (tcpdump host stbgo.org), and indeed
> I can see entries lines this:
>
I troubleshoot issues just like this for a living so I hope I can be
of some help. Others have already suggested some useful strategies so
I'll try to focus on ones that I haven't seen mentioned yet.
Off the bat based on what you've described I'd tend to suspect some
sort of transparent proxy, be it a stateful firewall or a intermediary
loadbalancer of some sort. The fact that your ssh connection from
the same source IP (I'm assuming) isn't showing any symptoms would
tend to de-emphasize layers 1-3 (IP on down to ethernet, ruling out
packetloss due to ethernet duplex mismatch/cabling and bad IP
routing, doesn't rule out rate limiting). However, if you've been
experiencing intermittent pauses with your ssh session, even if
they don't coincide with interruptions in http traffic then you may
still have a packet loss issue.
If you suspect packetloss, confirm with 'netstat -i' and look in
the Ierrs and Oerrs columns, they should both be 0 if everything
is spiff. Also check the TCP retransmit counters in 'netstat -s'
(you will always have some retransmission, you just don't want a
*lot* of it). I should note that I think this is a low probability
based on symptoms.
Actually based on the traffic snip you quoted, I tend to strongly
suspect firewall/loadbalancer/proxy.. note the source IP:
> 21:40:22.162536 192.168.2.54.35932 > 65.110.18.138.80:
> S 1526509984:1526509984(0) win 5840
> <mss 1460,sackOK,timestamp 980 52714 0,nop,wscale 0> (DF)
the source IP is 192.168.2.54 which isn't a routable IP address.
Unless you're coming through a VPN or are local to the network,
this would be clear evidence that there is a box in the middle
that's at least smart enough to do address translation.
To troubleshoot everything else I would start with recording a full
traffic capture from both the client and the server and try and
reproduce the problem. It sounds like that shouldn't be a problem.
On the client I'd run:
tcpdump -n -s 1600 -i <outgoing interface> -w clientside.dmp host <serverIP>
On the server I'd run
tcpdump -n -s 1600 -i <external interface> -w serverside.dmp
Plan on clientside.dmp and serverside.dmp files getting large fast. That's ok,
you just want to be sure to get everything.
Let these two dumps run and then proceed to reproduce the problem. If
you can, get a good mix of good connections vs failed ones. Then stop
the dumps, it's time for analysis.
For this, I'd recommend using the program 'tcptrace', which is in the
ports tree.
I'd start by looking in clientside.dmp for failed connection
attempts/short connections. You can do this using the command
tcptrace -n -b clientside.dmp
and you should see something like this:
hudson at Nikto:~/share > tcptrace -n -b dumpexample.dmp
1 arg remaining, starting with 'dumpexample.dmp'
Ostermann's tcptrace -- version 6.6.1 -- Wed Nov 19, 2003
496 packets seen, 496 TCP packets traced
elapsed wallclock time: 0:00:00.030771, 16119 pkts/sec analyzed
trace file elapsed time: 0:00:25.364361
TCP connection info:
1: 10.192.4.16:59723 - 72.14.253.99:80 (a2b) 7> 7< (complete)
2: 195.64.132.11:29957 - 10.192.4.16:80 (c2d) 1> 3<
3: 10.192.4.16:51717 - 198.238.212.10:80 (e2f) 30> 41<
4: 10.192.4.16:64601 - 198.238.212.10:80 (g2h) 17> 9<
5: 10.192.4.16:54693 - 198.238.212.10:80 (i2j) 26> 15< (complete)
6: 10.192.4.16:65285 - 198.238.212.30:80 (k2l) 33> 52<
7: 10.192.4.16:54362 - 66.35.250.151:80 (m2n) 5> 5< (complete)
8: 10.192.4.16:65391 - 66.35.250.150:80 (o2p) 14> 16< (complete)
This gives you a rough outline of the connections in the dump and tells you
how many packets in either direction were sent. If the connection failed,
you should see a very low packet count for the connection. Connection #2 in
the above example would be suspect for instance. Once you have isolated an
interesting connection, you can use tcpdump again to filter on that connection
and get the full story:
tcpdump -n -r dumpexample.dmp port 29957
My first hunch would be that there are intermittent connection
establishment failures thanks to the firewall/loadbalancer. This
would manifest itself as SYN's being seen on the client side that
are not seen on the server side. (find a connection where SYN's
aren't being answered in clientside.dmp and then check serverside.dmp
to see if the SYN's are being received, crossreference by time).
If the SYN's are making it to the server, then you have a server
issue, if they aren't then we're still looking at a potential
middlebox problem. When looking for the SYN's in the serverside dump,
don't filter by IP address as it's possible that the failure is due
to bad address translation somewhere... i.e. you may see SYN's being
received at the same time that the client is sending them but with
the wrong source IP address.
If the problem isn't restricted to connection-establishment, then I'd
look for connections in clientside.dmp that have long pauses in them and
try to explain those pauses by comparing with serverside.dmp. To isolate
connections with pauses in them, I'd again turn to the trusty 'tcptrace'
program. This time however I'd use the '-l' ("long") switch to get more
details on individual connections and grep for anomalies. Here's an example
of tcptrace long output:
TCP connection 1:
host a: 10.192.4.16:59723
host b: 72.14.253.99:80
complete conn: yes
first packet: Wed Dec 27 13:29:59.651504 2006
last packet: Wed Dec 27 13:30:02.302161 2006
elapsed time: 0:00:02.650656
total packets: 14
filename: dumpexample.dmp
a->b: b->a:
total packets: 7 total packets: 7
ack pkts sent: 6 ack pkts sent: 7
pure acks sent: 3 pure acks sent: 2
sack pkts sent: 0 sack pkts sent: 0
dsack pkts sent: 0 dsack pkts sent: 0
max sack blks/ack: 0 max sack blks/ack: 0
unique bytes sent: 1271 unique bytes sent: 2399
actual data pkts: 2 actual data pkts: 3
actual data bytes: 1271 actual data bytes: 2399
rexmt data pkts: 0 rexmt data pkts: 0
rexmt data bytes: 0 rexmt data bytes: 0
zwnd probe pkts: 0 zwnd probe pkts: 0
zwnd probe bytes: 0 zwnd probe bytes: 0
outoforder pkts: 0 outoforder pkts: 0
pushed data pkts: 2 pushed data pkts: 2
SYN/FIN pkts sent: 1/1 SYN/FIN pkts sent: 1/1
req 1323 ws/ts: Y/Y req 1323 ws/ts: N/N
adv wind scale: 0 adv wind scale: 0
req sack: Y req sack: N
sacks sent: 0 sacks sent: 0
urgent data pkts: 0 pkts urgent data pkts: 0 pkts
urgent data bytes: 0 bytes urgent data bytes: 0 bytes
mss requested: 1460 bytes mss requested: 1460 bytes
max segm size: 762 bytes max segm size: 1430 bytes
min segm size: 509 bytes min segm size: 151 bytes
avg segm size: 635 bytes avg segm size: 799 bytes
max win adv: 65535 bytes max win adv: 8190 bytes
min win adv: 64882 bytes min win adv: 6444 bytes
zero win adv: 0 times zero win adv: 0 times
avg win adv: 65441 bytes avg win adv: 7317 bytes
initial window: 509 bytes initial window: 2248 bytes
initial window: 1 pkts initial window: 2 pkts
ttl stream length: 1271 bytes ttl stream length: 2399 bytes
missed data: 0 bytes missed data: 0 bytes
truncated data: 0 bytes truncated data: 0 bytes
truncated packets: 0 pkts truncated packets: 0 pkts
data xmit time: 2.495 secs data xmit time: 2.566 secs
idletime max: 2449.3 ms idletime max: 2519.2 ms
throughput: 480 Bps throughput: 905 Bps
I'd look at the fields 'elapsed time' and 'idletime max' (in the
direction from the server to client only (in this case the
"b->a" column), the other direction will always have long idle times due
to the nature of HTTP). Some clever grepping should isolate
interesting candidate connections which you can then isolate with
tcpdump.
At this point, if it is indeed a middlebox problem, there's probably
not much you can do about it. But if you can isolate the symptoms
and even provide example tcpdumps illustrating the problem, then
you greatly increase the chances that your ISP's support staff can
resolve the problem. Many times even if they know that a problem
exists they may not know how to resolve it... having tcpdumps handy
makes it easier for them to show the problem to someone else (say
the firewall/loadbalancer vendor) who can tell them how to fix it.
I know this from experience, I work at a company that makes
loadbalancers. ;)
Hope that helps,
--
Matthew Hudson
More information about the freebsd-net
mailing list