sshd should not use TCP_NODELAY

Tue Jun 8 11:57:23 GMT 2004

Hi!

For some time now, ssh has been annoying me with little pauses.  These
pauses last about 1 second and until today I couldn't work out what could
be causing them.  It's particularly annoying when it happens in vi.

The hardware involved is a FreeBSD 4.9 machine (the server) behind a PIX,
over what I think is Frame Relay, thence to my ADSL, through a FreeBSD 4.8
PPPoE box, and finally to my FreeBSD 4.8 workstation.

Today, I found a directory where "ls" would give me the pause 100% of the
time.  Small output always works, as does a large amount of output.  Some
magical middle ground would fail.

The result of this particular "ls" is a blast of 34 small packets (nearly
all with a 80 byte payload), taking a shade over 4 ms to generate in total.
Of this flood of tinygrams, only 24 make it to the my workstation.  I don't
know why, but something fills up at 24 packets and the rest get eaten at
about 70% probability per packet.

The pause that I see is a retransmit timeout, which takes about 1.3 seconds.

The reason that fast retransmit didn't trigger is that there are not many
packets after the cutoff that get through, so the required number of duplicate
acks do not arrive.  An "ls" of a slightly larger directory works smoothly
because of this mechanism.

The data are sent in tinygrams because sshd sets TCP_NODELAY on the network
socket and because the pty hands ssh little bits of data in each read.

If you deliberately select protocol version 1 then extra code is activated
which waits up to 10ms for a good amount of data from the pty.  Packets are
sent with approximately 300 byte payloads using protocol version 1 and this
seems good enough to avoid my particular problem.  But that's only for the
version 1 protocol, and not for version 2!

My quick and ready answer to my problem is to hack openssh/packet.c and take
out the call to set_nodelay() from packet_set_interactive().  This means that
TCP_NODELAY is no longer set on the socket.  It also turns it off in the client
but I may change my mind on that yet.

But after this I had annoying 100ms pauses! :-)

To save you all from the suspense, this pause is due to delayed acks at
the client end.  The server is waiting for an ack before sending another
tinygram (though it's doing good by aggregating them into one good sized
lump while waiting).

To solve this, I set net.inet.tcp.delayed_ack=0, on the client, though I got
reasonable results from using net.inet.tcp.delacktime=10 (the minimum) as an
alternative.

So, what should I blame for all this?

The pty code?  It hands ssh tiny bits of data at a time.  I can't immediately
think of any fix I could make at this level though.

Ssh?  It no longer coalesces little bits of pty output into 256 byte chunks.
It also sets TCP_NODELAY, which guarantees that small writes equate to lots
of tinygrams.  More on this later...

The TCP stack?  Sending 34 packets in 4.4ms is silly for this application
(ssh) but may be just the right thing for another.  An old bug used to
limit TCP connections to 4 outstanding packets and so this problem never
used to appear.  Any sort of slow ramp-up of packets as opposed to the
usual ramping up of the window in bytes would also work, but that seems
to be expressly prohibited when TCP_NODELAY is set.

The NIC driver?  If it didn't accept those packets in one burst, the TCP
stack would have had a chance to coalesce some mbufs. :-)  OK, I'm only
kidding.  Really!  Stop looking at me like that!

The mysterious packet eater between my server and workstation?  Easy to
blame, but impossible to fix.  This is just how life is for me.  Maybe
it's like this all over the world.

My client machine?  The only thing it can change is delayed acks.  That's
handy if TCP_NODELAY is off, but not otherwise.

All in all I think the best short term fix is to not use TCP_NODELAY in
sshd.  Long term, there should be a way to stop TCP spewing tinygrams even
with TCP_NODELAY.  A delay of just 1ms between tinygrams would be enough
to stop this effect.

A marginally useful alternative to all this is to add the pty output
coalescing code to the version 2 protocol portion of ssh.  I'm not sure
this is as clearly a win as disabling TCP_NODELAY though.

By the way, an extensive discussion of the sshd vs TCP_NODELAY issue can
be found in the archives.  This is a message part way through the discussion
where Matt Dillon advocates removing TCP_NODELAY from ssh over two years ago:

    http://www.mail-archive.com/freebsd-hackers@freebsd.org/msg30608.html

As far as I can tell, there was no result from this discussion, though
some people did note how badly ssh with TCP_NODELAY works over a modem,
and by extension any high latency low bandwidth channel.

I understand that ssh is 3rd party, but it's supplied as part of the
FreeBSD base system.  With this in mind, I am happy to edit it to remove
the setting of TCP_NODELAY.  Is anybody else of the same opinion?

Similarly, does anyone have any idea of the long term effect of setting
delacktime to 10ms?  I doubt that it is any worse than using 100ms on
modern machines, and is less disruptive.  Has anyone done a study of
this?

Stephen.

PS Why is libssh installed in /usr/lib?  Why is a dynamic version created
at all?  Isn't it just part of ssh and sshd and with no other purpose?
I can assure you it really got in the way of debugging this problem to have
to make a new version of sshd that used a different library so that I didn't
destroy my ability to use the real sshd to log in!