read vs. mmap (or io vs. page faults)

Tue Jun 22 00:15:27 GMT 2004

:The mmap interface is supposed to be more efficient -- theoreticly --
:because it requires one less buffer-copying, and because it (together
:with the possible madvise()) provides the kernel with more information
:thus enabling it to make better (at least -- no worse) decisions.

    Well, I think you forgot my earlier explanation regarding buffer copying.
    Buffer copying is a very cheap operation if it occurs within the L1 or
    L2 cache, and that is precisely what is happening when you read() into
    a fixed buffer in a loop in a C program... your buffer is fixed in
    memory and is almost guarenteed to be in the L1/L2 cache, which means
    that the extra copy operation is very fast on a modern processor.  It's
    something like 12-16 GBytes/sec to the L1 cache on an Athlon 64, for
    example, and 3 GBytes/sec uncached to main memory.

    Consider the cpu time cost, then, of the local copy on a 2GB file...
    the cpu time cost on an AMD64 is about 2/12 of one second.  This is
    the number mmap would have to beat. 

    As you can see by your timing results, even on your fastest box,
    processing a file around that size is only going to incur 1-2 seconds
    of real time overhead to do the extra buffer copy.  2 seconds is a hard
    number to beat.

    This is something you can calculate yourself.  Time a dd from /dev/zero
    to /dev/null.

	crater# dd if=/dev/zero of=/dev/null bs=32k count=8192
	268435456 bytes transferred in 0.244561 secs (1097620804 bytes/sec)

	amd64# dd if=/dev/zero of=/dev/null bs=32k count=8192
	268435456 bytes transferred in 0.066994 secs (4006846790 bytes/sec)

	amd64# dd if=/dev/zero of=/dev/null bs=16m count=32
	536870912 bytes transferred in 0.431774 secs (1243407512 bytes/sec)

    Try it for different buffer sizes (16K through 16MB) and you will get
    a feel for how the L1 and L2 caches effect copying bandwidth.  These
    numbers are reasonably close to the raw memory bandwidth available to
    the cpu (and will be different depending on whether the buffer fits in
    the L1 or L2 caches, or doesn't fit at all).

    The mmap interface is not supposed to be more efficient, per say.  Why
    would it be?  There are overheads involved with mapping the page table
    entries and taking faults to map more.  Even if you pre-mapped everything,
    there are still overheads involved in populating the page table and
    performing invlpg operations on the TLB to reload the entry, and for
    large data sets there is overhead involved with removing page table
    entries and invalidating the pte.  On a modern cpu, where an L1 cache 
    copy is a two cycle streaming operation, the several hundred (or more)
    cycles it takes to process a page fault or even just populate the
    page table is equivalent to a lot of copied bytes.

    This immediately puts mmap() at a disadvantage on a modern cpu, but of
    course it also depends on what the data processing loop itself is
    doing.  If the data processing loop is sensitive to the L1 cache then
    processing larger chunks of data is going to be make it more efficient,
    and mmap() can certainly provide that where read() might require buffers
    too large to fit comfortably in the L1/L2 cache.  On the otherhand, if
    the processing loop is relatively insensitive to the L1 cache (i.e. its
    small), then you can afford to process the data in smaller chunks, like
    16K, without any significant penalty.

    mmap() is not designed to streamline large demand-page reads of data
    sets much larger then main memory.  mmap() works best for data that
    is already cached in the kernel, and even then it still has a fairly
    large hurdle to overcome vs a streaming read().  This is a HARDWARE
    limitation.  Drastic action would have to be taken in software to get
    rid of this overhead (we'd have to use 4MB page table entries, which
    come with their own problems).

    The overhead required to manage a large mmap'd data set can skyrocket.
    FreeBSD (and DragonFly) have heuristics that attempt to detect
    sequential operations like this with mmap'd data and to depress the
    page priority behind the read (so: read-ahead and depress-behind), and
    this works, but it only mitigates the additional overhead some, it 
    doesn't get rid of it.

    For linear processing of large data sets you almost universally want
    to use a read() loop.  There's no good reason to use mmap().

:=:	read: 10.619u 23.814s 1:17.67 44.3%   62+274k 11255+0io 0pf+0w
:=
:Well, now we are venturing into the domain of humans' subjective
:perception... I'd say, 12% is plenty, actually. This is what some people
:achieve by rewriting stuff in assembler -- and are proud, when it works
::-)

    Nobody is going to stare at their screen for one minute and 17 seconds
    and really care that something might take one minute and 27 seconds instead
    of one minute and 17 seconds.  That's subjective truth.

    The type of test you want to do is this:

    [start timing]
    [read all data into memory]
    [stop timing]	-> print timing results
    [start timing]
    [process all data]
    [stop timing]	-> print timing results

    Now you have something practical you can look at... you can look at the
    I/O bandwidth required to bring the data into memory without the
    complications of whatever processing you are doing on the data being
    mixed in.  *THEN* you can say something more definitive about the
    kernel overhead required to get the data into memory first, because
    you can definitely say what the 'bandwidth', or data rate, has been
    achieved in getting the data from the disk or kernel caches into 
    your program's memory space (faulted in and everything, ready to access).
    You could then compare that to the times required to do it in a mixed
    environment (read-processing loop).  If *THOSE* numbers are hugely 
    different then you can say something definitive about the relative
    efficiency of the mixed mode processing verses just doing pure I/O,
    for both read() and mmap() independantly. 

:...

:Put it into perspective -- 10-15% is usually the difference between
:the latest processor and the previous one. People are willing to pay
:hundreds of dollars premium...

    15% is nothing anyone cares about except perhaps gamers.  I certainly
    couldn't care less about 15%.  50%, on the otherhand, is something
    that I would care about.  But upgrading isn't just a function of raw
    cpu speed, it's also a function of general improvements in hardware
    and hardware interfaces... usb, usb2, firewire, sata, and so forth.

:...

:Besides, the differences can be higher. Here is from md5-ing a
:2097272832-bytes file over NFS (on a Gigabit network, no jumbo frames).
:The machine runs a FreeBSD-current on a single P4 2GHz:
:
:	mmap1: 17.115u 16.106s 2:20.84 23.5%   5+166k 0+0io 253421pf+0w
:	read1: 19.468u 12.179s 1:27.80 36.0%   4+163k 0+0io 0pf+0w
:	mmap2: 17.214u 13.265s 2:13.75 22.7%   5+165k 1+0io 204842pf+0w
:	read2: 19.142u 11.576s 1:20.22 38.2%   4+162k 0+0io 4pf+0w
:
:mmap is 87% slower (or read is 38% faster)! According to `systat -if',
:mmap was reading at about 13Mb/s, while read was consistently above
:20Mb/s.
:
:If this mmap-associated penalty is removed, the applications can save
:some memory by not using the BUFSIZ (or bigger) buffers, and the
:systems can save the time and effort of shuffling the memory from
:kernel buffers into user space (and flushing the instruction and data
:caches). The difference can be big -- on a CPU bound machine the sum
:of user time and system time is much smaller with mmap. For example,
:on this Solaris box running on Sparc-900MHz md5-ing a 16061698048-byte
:file (FreeBSD behaves similarly on the P2 400MHz reported earlier):
:
:	mmap: 215.290u 48.990s 7:18.81 60.2%  0+0k 0+0io 0pf+0w
:	read: 184.240u 142.350s 5:46.31 94.3% 0+0k 0+0io 0pf+0w
:		(264.28 vs. 326.59 CPU seconds)
:
:but read manages to saturate the CPU better -- 94% vs. 60% -- and win
:the "wall clock" race repeatedly...
:
:Yours,
:
:	-mi

    I think this points to inefficiencies in NFS's getpages() interface over
    its read() interface.  The read() interface (for NFS) definitely has better
    read-ahead characteristics.    The NFS getpages() interface in FreeBSD
    is about as primitive as it is possible to make it and still work, and
    its only marginally better in DragonFly (we get rid of some KVM allocations
    and deallocations).  In fact, I don't even think the NFS getpages interface
    uses the IOD's like the read interface does.  I think it might actually be
    a synchronous interface.

    It would be nice if someone were to improve the NFS getpages interface.
    I might do it myself, if I can find the time down the road.

					-Matt
					Matthew Dillon 
					<dillon at backplane.com>