DFLTPHYS vs MAXPHYS

Mon Jul 6 01:14:21 UTC 2009

    I think MAXPHYS, or the equivalent, is still used somewhat in the
    clustering code.  The number of buffers the clustering code decides to
    chain together dictates the impact on the actual device.  The relevancy
    here has very little to do with cache smashing and more to do with
    optimizing disk seeks (or network latency).  There is no best value for
    this.  It is only marginally more interesting for a network interface
    due to the fact that most links still run with absurdly small MTUs
    (even 9000+ is absurdly small).  It is entirely uninteresting for
    a SATA or other modern disk link.

    For linear transfers you only need a value sufficiently large to reduce
    the impact of command overhead on the cpu and achieve the device's
    maximum linear transfer rate For example, doing a dd with bs=512
    verses bs=32k.  It runs on a curve and there will generally be very
    little additional bang for the buck beyond 64K for a linear transfer
    (assuming read ahead and NCQ to reduce inter-command latency).

    For random and semi-random transfers a larger buffer sizes have two
    impacts.  First is a negative impact on seek times.  A random seek-read
    of 16K is faster then a random seek-read of 64K is faster then a random
    seek-read of 512K.  I did a ton of testing with HAMMER and it just
    didn't make much sense to go beyond 128K, frankly, but neither does it
    make sense to use something really tiny like 8K.  32K-128K seems to be
    the sweet spot.  The second is a positive impact on reducing the total
    number of seeks *IF* you have reasonable cache locality of reference.

    There is no correct value, it depends heavily on the access pattern.
    A random access pattern with very little locality of reference will
    benefit from a smaller block size while a random access pattern with
    high locality of reference will benefit from a larger block size.  That's
    all there is to it.

    I have a fairly negative opinion of trying to tune block size to cpu
    caches.  I don't think it matters nearly as much as tuning it to the
    seek/locality-of-reference performace curve, and I don't feel that
    contrived linear tests are all that interesting since they don't really
    reflect real-life work-loads.

    on-drive caching has an impact too, but that's another conversation.
    Vendors have been known to intentionally degrade drive cache performance
    on consumer drives verses commercial drives.  I've often hit limitations
    in testing HAMMER which seem to be contrived by vendors that would have
    allowed me to use a smaller block size and still get the locality of
    reference, but I wind up having to use a larger one because the drive
    cache doesn't behave sanely.

    --

    The DMA ability of modern devices and device drivers is pretty much moot
    as no self respecting disk controller chipset will be limited to a 
    measily 64K max transfer any more.  AHCI certainly has no issue doing
    in excess of a megabyte.  The limit is something like 65535 chained
    entries for AHCI.  I forget what the spec says exactly but it's
    basically more then we'd ever really need.  Nobody should really care
    about the performance of a chipset that is limited to a 64K max
    transfer.

    As long as the cluster code knows what the device can do and the
    filesystem doesn't try to use a larger block size the device is
    capable of in a single BIO, the cluster code will make up the
    difference for any device-based limitations.

						-Matt