read vs. mmap (or io vs. page faults)

Mon Jun 21 22:08:46 PDT 2004

On Monday 21 June 2004 08:15 pm, Matthew Dillon wrote:

= :The mmap interface is supposed to be more efficient -- theoreticly
= :-- because it requires one less buffer-copying, and because it
= :(together with the possible madvise()) provides the kernel with more
= :information thus enabling it to make better (at least -- no worse)
= :decisions.

=     Well, I think you forgot my earlier explanation regarding buffer
=     copying. = Buffer copying is a very cheap operation if it occurs
=     within the L1 or L2 cache, and that is precisely what is happening
=     when you read() int

This could explain, why using mmap is not faster than read, but it does
not explain, why it is slower.

I'm afraid, your vast knowledge of the internals of the kernel workings
obscure your vision. I, on the other hand, "enjoy" an almost total
ignorance of it, and can see, that mmap interface _allows_ for a more
(certainly, no _less_) efficient handling of the IO, than read. That the
kernel is not using all the information passed to it, I can only explain
by deficiencies/simplicity the implementation.

This is, sort of, self-perpetuating -- as long as mmap is slower/less
reliable, applications will be hesitant to use it, thus there will be
little insentive to improve it. :-(

=     As you can see by your timing results, even on your fastest box,
=     processing a file around that size is only going to incur 1-2
=     seconds of real time overhead to do the extra buffer copy. 2
=     seconds is a hard number to beat.

I'd rather call attention to my slower -- CPU-bound boxes. On them, the
total CPU time spent computing md5 of a file is less with mmap -- by a
noticable margin. But because the CPU is underutilized, the elapsed "wall
clock" time is higher.

As far as the cache-using statistics, having to do a cache-cache copy
doubles the cache used, stealing it from other processes/kernel tasks.

Here, again, is from my first comparision on the P2 400MHz:

	stdio: 56.837u 34.115s 2:06.61 71.8%   66+193k 11253+0io 3pf+0w
	mmap:  72.463u  7.534s 2:34.62 51.7%   5+186k  105+0io   22328pf+0w

91 vs. 78 seconds CPU time (15% win for mmap), but 126 vs. 154 elapsed
(22% loss)? Why is the CPU so underutilized in the mmap case? There was
nothing else running at the time. The CPU was, indeed, at about 88%
utilization, according to top. This alone seems to invalidate some of
what you are saying below about the immediate disadvantages of mmap on a
modern CPU.

Or is P2 400MHz not modern? May be, but the very modern Sparcs, on which
FreeBSD intends to run are not much faster.

=     The mmap interface is not supposed to be more efficient, per say.
=     Why would it be?

Puzzling question. Because the kernel is supplied with more information
-- it knows, that I only plan to _read_ from the memory (PROT_READ),
the total size of what I plan to read (mmap's len, optionally,
madvise's len), and (optionally), that I plan to read sequentially
(MADV_SEQUENTIONAL).

With that information, the kernel should be able to decide how many
pages to pre-fault in and, what and when to drop.

Mmap also needs no CPU data-cache to read. If the device is capable of
writing to memory directly (DMA?), the CPU does not need to be involved
at all, while with read the data still has to go from the DMA-filled
kernel buffer to the application buffer -- there being two copies of it
in cache instead of none for just storing or one copy for processing.

Also, in case of RAM shortage, mmap-ed pages can be just dropped, while
the too large buffer needs to be written into swap.

And mmap requires no application buffers -- win, win, and win. Is there
an inherent "lose" somewhere, I don't see? Like:

=   On a modern cpu, where an L1 cache copy is a two cycle streaming
=   operation, the several hundred (or more) cycles it takes to process
=   a page fault or even just populate the page table is equivalent to a
=   lot of copied bytes.

But each call to read also takes cycles -- in the user space (read()
function) and in the kernel (the syscall). And there are a lot of them
too...

=     mmap() is not designed to streamline large demand-page reads of
=     data sets much larger then main memory.

Then it was not designed to take advantage of all the possibilities of
the interface, I say.

=     mmap() works best for data that is already cached in the kernel,
=     and even then it still has a fairly large hurdle to overcome vs a
=     streaming read(). This is a HARDWARE limitation.

Wait, HARDWARE? Which hardware issues are we talking about? You
suggested, I pre-fault in the pages and Julian explained how best to do
it. If that is, indeed, the solution, why is not kernel doing it for me,
pre-faulting in the same number of bytes, that read pre-reads?

=     15% is nothing anyone cares about except perhaps gamers. I
=     certainly couldn't care less about 15%. 50%, on the otherhand,
=     is something that I would care about.

Well, here we have a server dedicated to storing compressed database
dumps. If it could compress these dumps 15% faster, we would certainly
be happier. We don't care, how quickly it can read/write them -- they
are backups, but we dump/compress them nightly and it has to finish by
next night.

Also, consider the assembler implementations of the parts of OpenSSL --
for Alphas, 586, 686, and some Sparcs. People wrote them, maintain them,
use them -- and gain just about as much. 15% is good.

Some business operations require CPU-intensive work done nightly -- and
15% is the difference between 9 and 10 hours, which can mean being or
not being able to finish overnight.

A database, that returns results 15%, nay, even 5% faster is also a
better database.

What are we arguing about? Who wouldn't take a 2.2GHz processor over a
2GHz one -- other things being equal -- and they are?

=     It would be nice if someone were to improve the NFS getpages interface.
=     I might do it myself, if I can find the time down the road.

Something good may still come out of this thread...

	-mi