PF or "traceroute -e -P TCP" bug?

Mon Aug 21 15:10:14 UTC 2006

On Mon, 21 Aug 2006 11:23:50 +0200
Daniel Hartmeier <daniel at benzedrine.cx> wrote:

> [ I'm CC'ing Crist, maybe he can explain why -e behaves like it does ]
> 
> On Fri, Aug 18, 2006 at 11:57:56PM +0300, Rostislav Krasny wrote:
> 
> > I've tried the new "-e" traceroute option on today's RELENG_6 and
> > found following problem:
> > 
> > > traceroute -nq 1 -e -P TCP -p 80 216.136.204.117
> 
> As I understand the -e option, that should send a sequence of TCP SYNs
> with
> 
>   - constant source port (randomly picked per invokation)
>   - constant destination port 80
>   - increasing TTL per probe
> 
> Assuming you pass the packets with pf, it matters whether you create
> state or not. Filtering statelessly (without 'keep state'), there should
> be no problem at all. I assume you're filtering statefully.

I don't use 'keep state' in any pf rule. But I use a nat rule like this:

nat on $ext_if from $internal_net to any -> ($ext_if)

and according to 'pfctl -s state' any NAT-ed TCP connection creates a
state. For example, during the above traceroute:

self tcp 192.168.1.2:34345 -> xxx.xxx.xxx.xxx:50646 -> 216.136.204.117:80       SYN_SENT:CLOSED

> With constant source and destination ports, the first probe should
> create a state entry and all further probes (of the same traceroute
> invokation) should match that state entry.
> 
> What you changed in your patch is switching to a sequential (instead of
> constant) source port. This forces creation of one state per probe,
> treating each probe as a separate connection.

Correct.

> I don't think that's in
> the spirit of the -e option. There's really no need for that, once the
> underlying problem is fixed.
> 
> So, why doesn't -e without your patch produce probes that all match a
> single state entry?

By the way, I asked a friend from IRC to try "traceroute -e -P TCP"
through his router which does NATing by natd and it worked there.

> Look at how the TCP sequence numbers are generated across the probes:
> 
> 	tcp->th_seq = (tcp->th_sport << 16) | (tcp->th_dport +
> 	    (fixedPort ? outdata->seq : 0));
> 
> This is the problem. traceroute increments the sequence number with each
> probe. I don't know why that is done. Why not use the same th_seq for
> all probes, like an ISN (initial sequence number) would be re-used in
> retransmissions in a real TCP handshake?
> 
> If you create state on the first TCP SYN pf sees, pf will note the ISN
> from the traceroute side. When pf sees further SYNs from that side, it
> will deal with them like with any client retransmitting the SYN of the
> handshake (before the peer replies with a SYN+ACK, giving its side's
> ISN). Subsequent TCP SYNs with different ISN matching the address/port
> pairs will be blocked by pf.
> 
> If this happens on the IP forwarding path (i.e. pf blocks the packet
> outgoing), the stack produces the ICMP host unreachable error that shows
> up as "!H" in traceroute. I assume you have a "pass out on $ext_if keep
> state" rule, and don't filter on the internal interface. If you add
> stateful filtering on the internal interface, I think you'll find that
> subsequent TCP SYNs are blocked without eliciting the ICMP error.
> 
> I suggest traceroute with -e uses fixed th_seq, as in
> 
> -	tcp->th_seq = (tcp->th_sport << 16) | (tcp->th_dport +
> -	   (fixedPort ? outdata->seq : 0));
> +	tcp->th_seq = (tcp->th_sport << 16) tcp->th_dport;

Even if I add accidentally deleted '|' it doesn't fix the problem:

> traceroute -nq 1 -e -P TCP -p 80 www.freebsd.org
traceroute to www.freebsd.org (216.136.204.117), 64 hops max, 52 byte packets
 1  192.168.1.1  0.525 ms
 2  10.0.0.138  2.122 ms
 3  *
 4  *
 5  *
 6  *
 7  *
 8  *
 9  *
10  152.63.3.122  191.562 ms
11  *
12  *
^C

I can decrease number of the "*" hops by -w option:

> traceroute -nq 1 -e -w 10 -P TCP -p 80 www.freebsd.org
traceroute to www.freebsd.org (216.136.204.117), 64 hops max, 52 byte packets
 1  192.168.1.1  0.506 ms
 2  10.0.0.138  1.886 ms
 3  *
 4  *
 5  *
 6  *
 7  212.143.12.45  151.282 ms
 8  *
^C

According to repeatedly ran 'pfctl -s state | grep 216.136.204.117'
it really has some relation to TCP states in the pf. Before the
212.143.12.45 hop the state closed and after that hop a new state
created.

And by the way, I think a tcp_check() function checks tcp->th_seq
incorrectly:

tcp->th_seq == (ident << 16) | (port + seq)

In original version or after my patch it should be changed to this:

tcp->th_seq == (htons(ident) << 16) | (port + (fixedPort ? seq : 0))

and after your patch to this:

tcp->th_seq == (htons(ident) << 16) | port

It looks like the return value of the tcp_check() isn't used anywhere
anyway.