Re: ssh connections break with "Fssh_packet_write_wait" on 13

From: Michael Gmelin <freebsd_at_grem.de>
Date: Thu, 03 Jun 2021 13:09:06 UTC

On Tue, 1 Jun 2021 13:47:47 +0200
Michael Gmelin <freebsd@grem.de> wrote:

> Hi,
> 
> Since upgrading servers from 12.2 to 13.0, I get
> 
>   Fssh_packet_write_wait: Connection to 1.2.3.4 port 22: Broken pipe
> 
> consistently, usually after about 11 idle minutes, that's with and
> without pf enabled. Client (11.4 in a VM) wasn't altered.
> 
> Verbose logging (client and server side) doesn't show anything special
> when the connection breaks. In the past, QoS problems caused these
> disconnects, but I didn't see anything apparent changing between 12.2
> and 13 in this respect.
> 
> I did a test on a newly commissioned server to rule out other factors
> (so, same client connections, some routes, same everything). On 12.2
> before the update: Connection stays open for hours. After the update
> (same server): connections breaks consistently after < 15 minutes
> (this is with unaltered configurations, no *AliveInterval configured
> on either side of the connection).
> 

I did a little bit more testing and realized that the problem goes away
when I disable "Proportional Rate Reduction per RFC 6937" on the server
side:

  sysctl net.inet.tcp.do_prr=0

Keeping it on and enabling net.inet.tcp.do_prr_conservative doesn't fix
the problem.

This seems to be specific to Parallels. After some more digging, I
realized that Parallels Desktop's NAT daemon (prl_naptd) handles
keep-alive between the VM and the external server on its own. There is
no direct communication between the client and the server. This means:

- The NAT daemon starts sending keep-alive packages right away (not
  after the VM's net.inet.tcp.keepidle), every 75 seconds.
- Keep-alive packages originating in the VM never reach the server.
- Keep-alive originating on the server never reaches the VM.
- Client and server basically do keep-alive with the nat daemon, not
  with each other.

It also seems like Parallels is filtering the tos field (so it's always
0x00), but that's unrelated.

I configured a bhyve VM running FreeBSD 11.4 on a separate laptop on
the same network for comparison and is has no such issues.

Looking at TCP dump output on the server, this is what a keep-alive
package sent by Parallels looks like:

  10:14:42.449681 IP (tos 0x0, ttl 64, id 15689, offset 0, flags [none],
    proto TCP (6), length 40)
    192.168.1.1.58222 > 192.168.1.2.22: Flags [.], cksum x (correct),
    seq 2534, ack 3851, win 4096, length 0

While those originating from the bhyve VM (after lowering
net.inet.tcp.keepidle) look like this:

  12:18:43.105460 IP (tos 0x0, ttl 62, id 0, offset 0, flags [DF],
    proto TCP (6), length 52)
    192.168.1.3.57555 > 192.168.1.2.22: Flags [.], cksum x
    (correct), seq 1780337696, ack 45831723, win 1026, options
    [nop,nop,TS val 3003646737 ecr 3331923346], length 0

Like written above, once net.inet.tcp.do_prr is disabled, keepalive
seems to be working just fine. Otherwise, Parallel's NAT daemon kills
the connection, as its keep-alive requests are not answered (well,
that's what I think is happening):

  10:19:43.614803 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF],
    proto TCP (6), length 40)
    192.168.1.1.58222 > 192.168.1.2.22: Flags [R.], cksum x (correct),
    seq 2535, ack 3851, win 4096, length 0

The easiest way to work around the problem Client side is to configure
ServerAliveInterval in ~/.ssh/config in the Client VM.

I'm curious though if this is basically a Parallels problem that has
only been exposed by PRR being more correct (which is what I suspect),
or if this is actually a FreeBSD problem.

Michael

-- 
Michael Gmelin