Slow disk write speeds over network

Wed Jun 11 00:47:54 PDT 2003

Jason Stone wrote:
> > You haven't said if you were using UDP or TCP for the mounts; you
> > should definitely use TCP with FreeBSD NFS servers; it's also just
> > generally a good idea, since UDP frags act as a fixed non-sliding
> > window: NFS over UDP sucks.
> 
> Huh.  I thought that the conventional wisdom was that on a local network
> with no packet loss (and therefore no re-transmission penalties), udp was
> way faster because the overhead was so much less.
> 
> Sorry if this seems like a pretty basic question, but can you explain
> this?

Sure:

1)	There is no such thing as no packet loss.

2)	The UDP packets are reassembled in a reassembly queue
	on the receiver.  While this is happening, you can only
	have one datagram outstanding at a time.  With TCP, you
	get a sliding window; with UDP, you stall waiting for
	the reassembly, effectively giving you a non-sliding
	window (request/response, with round trip latencies per
	packet, instead of two of them amortized across a 100M
	file transfer).

3)	When a packet is lost, the UDP retransmit code is rather
	crufty.  It resends the whole series of packets, and you
	eat the overhead for that.  TCP, on the other hand, can
	do selective acknowledgement, or, if it's not supported
	by both ends, it can at least acknowledge the packets
	that did get through, saving you a retransmit.

4)	FreeBSD's UDP fragment reassembly buffer code is well
	known to pretty much suck.  This is true of most UDP
	fragment reassembly code in the universe, however, and
	is not that specific to FreeBSD.  So sending UDP packets
	that get fragged because they're larger than your MTU is
	not a very clever way of achieving a fixed window size
	larger than the MTU (see also #2, above, for why you do
	not want to used an effectively fixed window protocol
	anyway).

Even if there were no packet loss at all with UDP, unless all
your data is around the size of one rsize/wsize/packet, the
combined RTT overhead for even a moderately large number of
packets in a single run is enough to trigger the amortized cost
of the additional TCP overhead being lower than the UDP overhead
from the latency.  Depending on your hardware (switch latency,
half duplex, etc.), you could also be talking about a significant
combined bandwidth delay product.

Now add to all this that you have to send explicit ACKs with UDP,
while you can use piggy-back ACKs on the return payloads for TCP.

I think the idea that UDP was OK for nearly-lossless short-haul
came about from people who couldn't code a working TCP NFS client.
8-).

-- Terry