read vs. mmap (or io vs. page faults)
Matthew Dillon
dillon at apollo.backplane.com
Sun Jun 20 11:35:19 PDT 2004
:Hello!
:
:I'm writing a message-digest utility, which operates on file and
:can use either stdio:
:
: while (not eof) {
: char buffer[BUFSIZE];
: size = read(.... buffer ...);
: process(buffer, size);
: }
:
:or mmap:
:
: buffer = mmap(... file_size, PROT_READ ...);
: process(buffer, file_size);
:
:I expected the second way to be faster, as it is supposed to avoid
:one memory copying (no user-space buffer). But in reality, on a
:CPU-bound (rather than IO-bound) machine, using mmap() is considerably
:slower. Here are the tcsh's time results:
read() is likely going to be faster because it does not involve any
page fault overhead. The VM system only faults 16 or so pages ahead
which is only 64KB, so the fault overhead is very high for the data rate.
Why does the extra copy not matter? Well, it's fairly simple, actually.
It's because your buffer is smaller then the L1 cache, and/or also simply
because the VM fault overhead is higher then it would take to copy
an extra 64KB.
read() loops typically use buffer sizes in the 8K-46K range. L1 caches
are typically 16K (for celeron class cpus) through 64K, or more for
higher end cpus. L2 caches are typically 256K-1MB, or more. The copy
bandwidth from or to the L1 cache is usually around 10x faster then main
memory and the copy bandwidth from or two L2 cache is usually
around 4x faster. (Note that I'm talking copy bandwidth here, not random
access. The L1 cache is ~50x faster or more for random access).
So the cost of the extra copy in a read() loop using a reasonable buffer
size (~8K-64K) (L1 or L2 access) is virtually nil compared to the cost
of accessing the kernel's buffer cache (which involves main memory
accesses for files > L2 cache).
:On the IO-bound machine, using mmap is only marginally faster:
:
: Single Pentium4M (Centrino 1GHz) runing recent -current:
: --------------------------------------------------------
:stdio: 27.195u 8.280s 1:33.02 38.1% 10+169k 11221+0io 1pf+0w
:mmap: 26.619u 3.004s 1:23.59 35.4% 10+169k 47+0io 19463pf+0w
Yes, because it's I/O bound. As long as the kernel queues some readahead
to the device it can burn those cpu cycles on whatever it wants without
really effecting the transfer rate.
:I this how things are supposed to be, or will mmap() become more
:efficient eventually? Thanks!
:
: -mi
It's hard to say. mmap() could certainly be made more efficient, e.g.
by faulting in more pages at a time to reduce the actual fault rate.
But it's fairly difficult to beat a read copy into a small buffer.
-Matt
Matthew Dillon
<dillon at backplane.com>
More information about the freebsd-questions
mailing list