read vs. mmap (or io vs. page faults)

Tue Jun 22 02:21:47 GMT 2004

Matthew Dillon wrote:
> Mikhail Teterin wrote:
>>=    Both read and mmap have a read-ahead heuristic. The heuristic
>>=    works. In fact, the mmap heuristic is so smart it can read-behind
>>=    as well as read-ahead if it detects a backwards scan.
>> 
>> Evidently, read's heuristics are better. At least, for this task. I'm,
>> actually, surprised, they are _different_ at all.

It might be interesting to retry your tests under a Mach kernel.  BSD has 
multiple codepaths for IPC functionality that are unified under Mach.

>> The mmap interface is supposed to be more efficient -- theoreticly --
>> because it requires one less buffer-copying, and because it (together
>> with the possible madvise()) provides the kernel with more information
>> thus enabling it to make better (at least -- no worse) decisions.

I've heard people repeat the same notion, that is to say "that mmap()ing a 
file is supposed to be faster than read()ing it" [1], but the two operations 
are not quite the same thing, and there is more work being done to mmap a file 
(and thus gain random access to any byte of the file by dereferencing memory), 
  than to read and process small blocks of data at a time.

Matt's right that processing a small block that fits into L1/L2 cache (and 
probably already is resident) is very fast.  The extra copy doesn't matter as 
much as it once did on slower machines, and he's provided some good analysis 
of L1/L2 caching issues and buffer copying speeds.

However, I tend to think the issue of buffer copying speeds are likely to be 
moot when you are reading from disk and are thus I/O bound [2], rather than 
having the manner in which the file's contents are represented to the program 
being that significant.

---------
[1]: Actually, while it is intuitive that trying to tell the system, "hey, I 
want all of that file read into RAM now, as quickly as you can using mmap() 
and madvise()", what happens with systems which use demand-paging VM (like 
FreeBSD, Linux, and most others) is far more lazy:

In reality, your process gets nothing but a promise from mmap() that if you 
access the right chunk of memory, your program will unblock once that data has 
been read and faulted into the local address space.  That level of urgency 
doesn't seem to correspond to what you asked for :-), although it still works 
pretty well in practice.

[2]: We're talking about maybe 20 to 60 or so MB/s for disk, versus 10x to 
100x that for RAM to RAM copying, much less the L2 copying speeds Matt 
mentions below:

>     Well, I think you forgot my earlier explanation regarding buffer copying.
>     Buffer copying is a very cheap operation if it occurs within the L1 or
>     L2 cache, and that is precisely what is happening when you read() into
>     a fixed buffer in a loop in a C program... your buffer is fixed in
>     memory and is almost guarenteed to be in the L1/L2 cache, which means
>     that the extra copy operation is very fast on a modern processor.  It's
>     something like 12-16 GBytes/sec to the L1 cache on an Athlon 64, for
>     example, and 3 GBytes/sec uncached to main memory.

This has been an interesting discussion, BTW, thanks.

-- 
-Chuck