Re: ssh connections break with "Fssh_packet_write_wait" on 13 [SOLVED]

Reply: tuexen_a_freebsd.org: "Re: ssh connections break with "Fssh_packet_write_wait" on 13 [SOLVED]"
In reply to: Michael Gmelin : "Re: ssh connections break with "Fssh_packet_write_wait" on 13 [SOLVED]"
Go to: [ bottom of page ] [ top of archives ] [ this month ]
From: Don Lewis <truckman_at_FreeBSD.org>
Date: Wed, 09 Jun 2021 06:57:24 UTC
On  8 Jun, Michael Gmelin wrote:
> 
> 
> On Thu, 3 Jun 2021 15:09:06 +0200
> Michael Gmelin <freebsd@grem.de> wrote:
> 
>> On Tue, 1 Jun 2021 13:47:47 +0200
>> Michael Gmelin <freebsd@grem.de> wrote:
>> 
>> > Hi,
>> > 
>> > Since upgrading servers from 12.2 to 13.0, I get
>> > 
>> >   Fssh_packet_write_wait: Connection to 1.2.3.4 port 22: Broken pipe
>> > 
>> > consistently, usually after about 11 idle minutes, that's with and
>> > without pf enabled. Client (11.4 in a VM) wasn't altered.
>> > 
>> > Verbose logging (client and server side) doesn't show anything
>> > special when the connection breaks. In the past, QoS problems
>> > caused these disconnects, but I didn't see anything apparent
>> > changing between 12.2 and 13 in this respect.
>> > 
>> > I did a test on a newly commissioned server to rule out other
>> > factors (so, same client connections, some routes, same
>> > everything). On 12.2 before the update: Connection stays open for
>> > hours. After the update (same server): connections breaks
>> > consistently after < 15 minutes (this is with unaltered
>> > configurations, no *AliveInterval configured on either side of the
>> > connection). 
>> 
>> I did a little bit more testing and realized that the problem goes
>> away when I disable "Proportional Rate Reduction per RFC 6937" on the
>> server side:
>> 
>>   sysctl net.inet.tcp.do_prr=0
>> 
>> Keeping it on and enabling net.inet.tcp.do_prr_conservative doesn't
>> fix the problem.
>> 
>> This seems to be specific to Parallels. After some more digging, I
>> realized that Parallels Desktop's NAT daemon (prl_naptd) handles
>> keep-alive between the VM and the external server on its own. There is
>> no direct communication between the client and the server. This means:
>> 
>> - The NAT daemon starts sending keep-alive packages right away (not
>>   after the VM's net.inet.tcp.keepidle), every 75 seconds.
>> - Keep-alive packages originating in the VM never reach the server.
>> - Keep-alive originating on the server never reaches the VM.
>> - Client and server basically do keep-alive with the nat daemon, not
>>   with each other.
>> 
>> It also seems like Parallels is filtering the tos field (so it's
>> always 0x00), but that's unrelated.
>> 
>> I configured a bhyve VM running FreeBSD 11.4 on a separate laptop on
>> the same network for comparison and is has no such issues.
>> 
>> Looking at TCP dump output on the server, this is what a keep-alive
>> package sent by Parallels looks like:
>> 
>>   10:14:42.449681 IP (tos 0x0, ttl 64, id 15689, offset 0, flags
>> [none], proto TCP (6), length 40)
>>     192.168.1.1.58222 > 192.168.1.2.22: Flags [.], cksum x (correct),
>>     seq 2534, ack 3851, win 4096, length 0
>> 
>> While those originating from the bhyve VM (after lowering
>> net.inet.tcp.keepidle) look like this:
>> 
>>   12:18:43.105460 IP (tos 0x0, ttl 62, id 0, offset 0, flags [DF],
>>     proto TCP (6), length 52)
>>     192.168.1.3.57555 > 192.168.1.2.22: Flags [.], cksum x
>>     (correct), seq 1780337696, ack 45831723, win 1026, options
>>     [nop,nop,TS val 3003646737 ecr 3331923346], length 0
>> 
>> Like written above, once net.inet.tcp.do_prr is disabled, keepalive
>> seems to be working just fine. Otherwise, Parallel's NAT daemon kills
>> the connection, as its keep-alive requests are not answered (well,
>> that's what I think is happening):
>> 
>>   10:19:43.614803 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF],
>>     proto TCP (6), length 40)
>>     192.168.1.1.58222 > 192.168.1.2.22: Flags [R.], cksum x (correct),
>>     seq 2535, ack 3851, win 4096, length 0
>> 
>> The easiest way to work around the problem Client side is to configure
>> ServerAliveInterval in ~/.ssh/config in the Client VM.
>> 
>> I'm curious though if this is basically a Parallels problem that has
>> only been exposed by PRR being more correct (which is what I suspect),
>> or if this is actually a FreeBSD problem.
>> 
> 
> So, PRR probably was a red herring and the real reason that's happening
> is that FreeBSD (since version 13[0]) by default discards packets
> without timestamps for connections that formally had negotiated to have
> them. This new behavior seems to be in line with RFC 7323, section
> 3.2[1]:
> 
>     "Once TSopt has been successfully negotiated, that is both <SYN> and
>     <SYN,ACK> contain TSopt, the TSopt MUST be sent in every non-<RST>
>     segment for the duration of the connection, and SHOULD be sent in an
>     <RST> segment (see Section 5.2 for details)."
> 
> As it turns out, macOS does exactly this - send keep-alive packets
> without a timestamp for connections that were negotiated to have them.

I wonder if I'm running into this with ssh connections to freefall.  My
outgoing IPv6 connections pass through an ipfw firewall that uses
dynamic rules.  When the dynamic rule gets close to expiration, it
generates keep alive packets that just seem to be ignored by freefall.
Eventually the dynamic rule expires, then sometime later sshd on
freefall sends a keepalive which gets dropped at my end.