Re: TCP sends 9KB segments via n etgraph_tunnel_despite_MTU/MSS_— TSO-related?

From: Konstantin Belousov <kib_at_freebsd.org>
Date: Wed, 14 May 2025 23:18:11 UTC
On Wed, May 14, 2025 at 10:45:27PM +0300, Ivan wrote:
> Hello,
> 
> I've been investigating a network issue that took quite some time to trace. I still cannot reproduce it in a test environment, but it consistently occurs on a specific FreeBSD server with a more complex network configuration.
> 
> Summary of the issue:  
> Under certain conditions, the system attempts to send TCP packets larger than 9 KB through a netgraph-based tunnel with MTU 1472, even though MSS was negotiated to 1400.
> 
> This happens when the initial route is via the default uplink, but PF then re-routes the packet via the netgraph tunnel using `route-to`. If the traffic is routed through ng0 directly (without PF), the issue does not occur. The problem also disappears if TSO is disabled on the uplink NIC.
> 
> System:
>   FreeBSD 13.5-RELEASE
>   releng/13.5-n259162-882b9f3f2218 GENERIC amd64
> 
> Interfaces:
> 
> - Primary LAN interface (where disabling TSO fixes the problem):
>     igb0, MTU 1500  
>     options=4e520bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,
>                     VLAN_HWCSUM,WOL_MAGIC,VLAN_HWFILTER,VLAN_HWTSO,
>                     RXCSUM_IPV6,TXCSUM_IPV6,NOMAP>
> 
> - Internet uplink:
>     onp, VLAN over igb0, MTU 1500  
>     options=4600003<RXCSUM,TXCSUM,RXCSUM_IPV6,TXCSUM_IPV6,NOMAP>
> 
> - Netgraph tunnel:
>     ng0, MTU 1472  
>     inet 10.10.0.1 → 10.10.0.2
> 
> PF rules used for re-routing:
>     nat log(all) on onp inet from 10.10.0.1 to any tag NG -> (ng0) round-robin
>     pass out quick on onp route-to (ng0 10.10.0.2) inet all flags S/SA keep state tagged NG
> 
> Packet trace (via pflog during a POST request ~10KB to YouTube):
> 
>     15:46:01.784956 IP 10.10.0.1.62031 > 209.85.233.198.443: Flags [P.], seq 597:9703, length 9106
>     15:46:01.785020 IP 127.0.0.1 > 10.10.0.1: ICMP 209.85.233.198 unreachable - need to frag (mtu 1472)
> 
> This shows the kernel trying to send a 9106-byte segment over a link that clearly can't handle it. The MSS was already negotiated at 1400, so this seems unexpected. The ICMP response is generated locally. The result is segment loss, out-of-order retransmissions, and poor TLS performance.
> 
> I also reproduced this behavior with OpenVPN — so the issue is not netgraph-specific.
> 
> Questions:
> - Is this expected behavior due to TSO interacting poorly with PF route-to?
> - Should TSO respect the effective MTU based on the post-PF routing decision?
> - Or is this a bug in the TCP offload path?

TCP output code decides to enable TSO based on the outgoing interface caps.
The interface is looked up through the routing table.  There is no knowledge,
and probably should not be, of the packet filter mangling the packets after
it was passed to the ip_output() (and later).

BTW, I think that route-to would similarly break TLS and IPSEC inline offloads.