DFLTPHYS vs MAXPHYS
Matthew Dillon
dillon at apollo.backplane.com
Mon Jul 6 01:14:21 UTC 2009
I think MAXPHYS, or the equivalent, is still used somewhat in the
clustering code. The number of buffers the clustering code decides to
chain together dictates the impact on the actual device. The relevancy
here has very little to do with cache smashing and more to do with
optimizing disk seeks (or network latency). There is no best value for
this. It is only marginally more interesting for a network interface
due to the fact that most links still run with absurdly small MTUs
(even 9000+ is absurdly small). It is entirely uninteresting for
a SATA or other modern disk link.
For linear transfers you only need a value sufficiently large to reduce
the impact of command overhead on the cpu and achieve the device's
maximum linear transfer rate For example, doing a dd with bs=512
verses bs=32k. It runs on a curve and there will generally be very
little additional bang for the buck beyond 64K for a linear transfer
(assuming read ahead and NCQ to reduce inter-command latency).
For random and semi-random transfers a larger buffer sizes have two
impacts. First is a negative impact on seek times. A random seek-read
of 16K is faster then a random seek-read of 64K is faster then a random
seek-read of 512K. I did a ton of testing with HAMMER and it just
didn't make much sense to go beyond 128K, frankly, but neither does it
make sense to use something really tiny like 8K. 32K-128K seems to be
the sweet spot. The second is a positive impact on reducing the total
number of seeks *IF* you have reasonable cache locality of reference.
There is no correct value, it depends heavily on the access pattern.
A random access pattern with very little locality of reference will
benefit from a smaller block size while a random access pattern with
high locality of reference will benefit from a larger block size. That's
all there is to it.
I have a fairly negative opinion of trying to tune block size to cpu
caches. I don't think it matters nearly as much as tuning it to the
seek/locality-of-reference performace curve, and I don't feel that
contrived linear tests are all that interesting since they don't really
reflect real-life work-loads.
on-drive caching has an impact too, but that's another conversation.
Vendors have been known to intentionally degrade drive cache performance
on consumer drives verses commercial drives. I've often hit limitations
in testing HAMMER which seem to be contrived by vendors that would have
allowed me to use a smaller block size and still get the locality of
reference, but I wind up having to use a larger one because the drive
cache doesn't behave sanely.
--
The DMA ability of modern devices and device drivers is pretty much moot
as no self respecting disk controller chipset will be limited to a
measily 64K max transfer any more. AHCI certainly has no issue doing
in excess of a megabyte. The limit is something like 65535 chained
entries for AHCI. I forget what the spec says exactly but it's
basically more then we'd ever really need. Nobody should really care
about the performance of a chipset that is limited to a 64K max
transfer.
As long as the cluster code knows what the device can do and the
filesystem doesn't try to use a larger block size the device is
capable of in a single BIO, the cluster code will make up the
difference for any device-based limitations.
-Matt
More information about the freebsd-arch
mailing list