scp more perfectly fills the pipe than NFS/TCP

Mon Dec 21 21:39:31 UTC 2009

    I'm just covering all the bases.  To be frank, half the time when
    someone posts they are doing something a certain way it turns out that
    they actually aren't.  I've learned that covering the bases tends to
    lead to solutions more quickly than assuming a perfect rendition.

    For example, is that 10ms latency with a ping?  What about a
    ping -s 4000?  If you are talking about 16KB RCP transactions over
    TCP then the real question is what is the latency for 16KB of data
    coming back along the wire?

    In your case we can calculate the read-ahead needed to keep the pipe
    full.  500 KBytes/sec divided by 16KB is 31 transactions per second,
    or an effective latency of 32ms + probably 5-10 for the RPC to be
    sent... so probably more around 40ms.  Not 10ms.  And if you are using
    32KB transactions the latency is going to be more around 70ms.

    500K x 40ms = is about 20KB, so theoretically a read-ahead of
    2 packets should do the trick.

    There's a catch, however.  Depending on the client-side implementation
    the read-ahead requests may be transmitted out of order.  That is
    if the cp or dd program wants to read blocks 0, 1, 2, 3, 4, the
    actual RPC's sent over the wire might be sent like this:  0, 2, 1, 4, 3,
    or even 0, 4, 1, 2, 3.  Someone who know what work was done on the
    FreeBSD NFS stack can tell you whether that is still the case.  If
    the nfsiod's (whether kernel threads or not) are separate synchronous
    RPCs then the read-ahead can transmit the RPC requests out of order.
    The server may also respond to them out of order... (typically there
    being 4 server-side threads handling RPCs).  The combination is deadly.

    If the read-aheads transmit out of order what happens is that
    cp/dd/whatever on the client winds up stalling waiting for the
    next linear block to come back, which might be BEHIND a later
    read-ahead block coming back down the wire.  That is, the stall,
    the RPC latency winds up being multiplied by N.  A 40ms turn can
    turn into an 80 or 120ms turn before the cp/dd/whatever unstalls.

    To deal with this you want to set the read-ahead higher... probably at
    least 3 or four RPCs.

    As I said, there are other issues as the amount of read-ahead
    increases.  The only way to really figure out what is going on is
    to tcpdump the link and determine why the pipeline is not being
    maintained.  Look for out of order requests, out of order responses,
    and stalls (actual lost packets).

    Actual lost packets are not likely in your case, assuming you are
    using something like fair-share scheduling and not RED (RED should
    only be used by routers in the middle of a large network, it should
    never be used at the end-points).

						-Matt