read vs. mmap (or io vs. page faults)

Sun Jun 20 18:35:19 GMT 2004

:Hello!
:
:I'm writing a message-digest utility, which operates on file and
:can use either stdio:
:
:	while (not eof) {
:		char buffer[BUFSIZE];
:		size = read(.... buffer ...);
:		process(buffer, size);
:	}
:
:or mmap:
:
:	buffer = mmap(... file_size, PROT_READ ...);
:	process(buffer, file_size);
:
:I expected the second way to be faster, as it is supposed to avoid
:one memory copying (no user-space buffer). But in reality, on a
:CPU-bound (rather than IO-bound) machine, using mmap() is considerably
:slower. Here are the tcsh's time results:

    read() is likely going to be faster because it does not involve any
    page fault overhead.  The VM system only faults 16 or so pages ahead 
    which is only 64KB, so the fault overhead is very high for the data rate.

    Why does the extra copy not matter?  Well, it's fairly simple, actually.
    It's because your buffer is smaller then the L1 cache, and/or also simply
    because the VM fault overhead is higher then it would take to copy
    an extra 64KB.

    read() loops typically use buffer sizes in the 8K-46K range.  L1 caches
    are typically 16K (for celeron class cpus) through 64K, or more for
    higher end cpus.  L2 caches are typically 256K-1MB, or more.  The copy
    bandwidth from or to the L1 cache is usually around 10x faster then main
    memory and the copy bandwidth from or two L2 cache is usually
    around 4x faster.  (Note that I'm talking copy bandwidth here, not random
    access.  The L1 cache is ~50x faster or more for random access).

    So the cost of the extra copy in a read() loop using a reasonable buffer
    size (~8K-64K) (L1 or L2 access) is virtually nil compared to the cost
    of accessing the kernel's buffer cache (which involves main memory
    accesses for files > L2 cache).

:On the IO-bound machine, using mmap is only marginally faster:
:
:	Single Pentium4M (Centrino 1GHz) runing recent -current:
:	--------------------------------------------------------
:stdio:	27.195u 8.280s 1:33.02 38.1%    10+169k 11221+0io 1pf+0w
:mmap:	26.619u 3.004s 1:23.59 35.4%    10+169k 47+0io 19463pf+0w

    Yes, because it's I/O bound.  As long as the kernel queues some readahead
    to the device it can burn those cpu cycles on whatever it wants without
    really effecting the transfer rate.

:I this how things are supposed to be, or will mmap() become more
:efficient eventually? Thanks!
:
:	-mi

    It's hard to say.  mmap() could certainly be made more efficient, e.g.
    by faulting in more pages at a time to reduce the actual fault rate.
    But it's fairly difficult to beat a read copy into a small buffer.

					-Matt
					Matthew Dillon 
					<dillon at backplane.com>