Re: ssh connections break with "Fssh_packet_write_wait" on 13 [SOLVED]

From: <tuexen_at_freebsd.org>
Date: Wed, 09 Jun 2021 09:22:55 UTC
> On 9. Jun 2021, at 08:57, Don Lewis <truckman@freebsd.org> wrote:
> 
> On  8 Jun, Michael Gmelin wrote:
>> 
>> 
>> On Thu, 3 Jun 2021 15:09:06 +0200
>> Michael Gmelin <freebsd@grem.de> wrote:
>> 
>>> On Tue, 1 Jun 2021 13:47:47 +0200
>>> Michael Gmelin <freebsd@grem.de> wrote:
>>> 
>>>> Hi,
>>>> 
>>>> Since upgrading servers from 12.2 to 13.0, I get
>>>> 
>>>> Fssh_packet_write_wait: Connection to 1.2.3.4 port 22: Broken pipe
>>>> 
>>>> consistently, usually after about 11 idle minutes, that's with and
>>>> without pf enabled. Client (11.4 in a VM) wasn't altered.
>>>> 
>>>> Verbose logging (client and server side) doesn't show anything
>>>> special when the connection breaks. In the past, QoS problems
>>>> caused these disconnects, but I didn't see anything apparent
>>>> changing between 12.2 and 13 in this respect.
>>>> 
>>>> I did a test on a newly commissioned server to rule out other
>>>> factors (so, same client connections, some routes, same
>>>> everything). On 12.2 before the update: Connection stays open for
>>>> hours. After the update (same server): connections breaks
>>>> consistently after < 15 minutes (this is with unaltered
>>>> configurations, no *AliveInterval configured on either side of the
>>>> connection). 
>>> 
>>> I did a little bit more testing and realized that the problem goes
>>> away when I disable "Proportional Rate Reduction per RFC 6937" on the
>>> server side:
>>> 
>>> sysctl net.inet.tcp.do_prr=0
>>> 
>>> Keeping it on and enabling net.inet.tcp.do_prr_conservative doesn't
>>> fix the problem.
>>> 
>>> This seems to be specific to Parallels. After some more digging, I
>>> realized that Parallels Desktop's NAT daemon (prl_naptd) handles
>>> keep-alive between the VM and the external server on its own. There is
>>> no direct communication between the client and the server. This means:
>>> 
>>> - The NAT daemon starts sending keep-alive packages right away (not
>>> after the VM's net.inet.tcp.keepidle), every 75 seconds.
>>> - Keep-alive packages originating in the VM never reach the server.
>>> - Keep-alive originating on the server never reaches the VM.
>>> - Client and server basically do keep-alive with the nat daemon, not
>>> with each other.
>>> 
>>> It also seems like Parallels is filtering the tos field (so it's
>>> always 0x00), but that's unrelated.
>>> 
>>> I configured a bhyve VM running FreeBSD 11.4 on a separate laptop on
>>> the same network for comparison and is has no such issues.
>>> 
>>> Looking at TCP dump output on the server, this is what a keep-alive
>>> package sent by Parallels looks like:
>>> 
>>> 10:14:42.449681 IP (tos 0x0, ttl 64, id 15689, offset 0, flags
>>> [none], proto TCP (6), length 40)
>>>   192.168.1.1.58222 > 192.168.1.2.22: Flags [.], cksum x (correct),
>>>   seq 2534, ack 3851, win 4096, length 0
>>> 
>>> While those originating from the bhyve VM (after lowering
>>> net.inet.tcp.keepidle) look like this:
>>> 
>>> 12:18:43.105460 IP (tos 0x0, ttl 62, id 0, offset 0, flags [DF],
>>>   proto TCP (6), length 52)
>>>   192.168.1.3.57555 > 192.168.1.2.22: Flags [.], cksum x
>>>   (correct), seq 1780337696, ack 45831723, win 1026, options
>>>   [nop,nop,TS val 3003646737 ecr 3331923346], length 0
>>> 
>>> Like written above, once net.inet.tcp.do_prr is disabled, keepalive
>>> seems to be working just fine. Otherwise, Parallel's NAT daemon kills
>>> the connection, as its keep-alive requests are not answered (well,
>>> that's what I think is happening):
>>> 
>>> 10:19:43.614803 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF],
>>>   proto TCP (6), length 40)
>>>   192.168.1.1.58222 > 192.168.1.2.22: Flags [R.], cksum x (correct),
>>>   seq 2535, ack 3851, win 4096, length 0
>>> 
>>> The easiest way to work around the problem Client side is to configure
>>> ServerAliveInterval in ~/.ssh/config in the Client VM.
>>> 
>>> I'm curious though if this is basically a Parallels problem that has
>>> only been exposed by PRR being more correct (which is what I suspect),
>>> or if this is actually a FreeBSD problem.
>>> 
>> 
>> So, PRR probably was a red herring and the real reason that's happening
>> is that FreeBSD (since version 13[0]) by default discards packets
>> without timestamps for connections that formally had negotiated to have
>> them. This new behavior seems to be in line with RFC 7323, section
>> 3.2[1]:
>> 
>>   "Once TSopt has been successfully negotiated, that is both <SYN> and
>>   <SYN,ACK> contain TSopt, the TSopt MUST be sent in every non-<RST>
>>   segment for the duration of the connection, and SHOULD be sent in an
>>   <RST> segment (see Section 5.2 for details)."
>> 
>> As it turns out, macOS does exactly this - send keep-alive packets
>> without a timestamp for connections that were negotiated to have them.
> 
> I wonder if I'm running into this with ssh connections to freefall.  My
> outgoing IPv6 connections pass through an ipfw firewall that uses
> dynamic rules.  When the dynamic rule gets close to expiration, it
> generates keep alive packets that just seem to be ignored by freefall.
> Eventually the dynamic rule expires, then sometime later sshd on
> freefall sends a keepalive which gets dropped at my end.
ipfw sends non-compliant keep alives:
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=253476
Basically, a node after an ipfw instance seems to be broken from the
perspective of the peer.

Best regards
Michael
> 
>