read vs. mmap (or io vs. page faults)

Mon Jun 21 22:10:23 GMT 2004

=    Both read and mmap have a read-ahead heuristic. The heuristic
=    works. In fact, the mmap heuristic is so smart it can read-behind
=    as well as read-ahead if it detects a backwards scan.

Evidently, read's heuristics are better. At least, for this task. I'm,
actually, surprised, they are _different_ at all.

The mmap interface is supposed to be more efficient -- theoreticly --
because it requires one less buffer-copying, and because it (together
with the possible madvise()) provides the kernel with more information
thus enabling it to make better (at least -- no worse) decisions.

That these theoretical advantages -- small or not -- are eaten by,
what seems like, practical implementation deficiencies to the point,
that using mmap is not only not faster, but frequently slower --
wallclock-wise -- is, in itself, a serious shortcoming, that stands
between an OS and perfection.

That other OSes have similar shortcomings simply gives us some breathing
room from an advocacy point of view. I hope, my rhetoric will burn an
itch in someone capable of addressing it technically :-)

=    The heuristic does not try to read megabytes and megabytes ahead,
=    however...

Neither does the read-handling.

=    that might speed up this particular application a little, but it
=    would destroy performance for many other types of applications,
=    especially in a loaded environment.

I'm not asking mmap (page fault handling) to cache any more aggressively,
than read-handling does.

=    Well now hold a second... the best you can do here is compare relative
=    differences between mmap and read.

This is all I am doing, actually. :-)

=    If you really want to compare operating systems, you have to run the
=    OS's and the tests on the same hardware.

I am comparing relative differences between between read and mmap on
different OSes.

=:		4.8-stable on Pentium2-400MHz
=:	mmap: 21.507u 11.472s 1:27.53 37.6%   62+276k 99+0io 44736pf+0w
=:	read: 10.619u 23.814s 1:17.67 44.3%   62+274k 11255+0io 0pf+0w
=
=    mmap 12% slower then read.  12% isn't much.

Well, now we are venturing into the domain of humans' subjective
perception... I'd say, 12% is plenty, actually. This is what some people
achieve by rewriting stuff in assembler -- and are proud, when it works
:-)

=:		recent -current on dual P2 Xeon-450MHz (mmap WINS -- SMP?)
=:	mmap: 12.482u 12.872s 2:28.70 17.0%   74+298k 23+0io 46522pf+0w
=:	read: 7.255u 16.366s 3:27.07 11.4%    70+283k 44437+0io 7pf+0w
=
=    mmap 39% faster.  That's a significant difference.
=
=    It kinda smells funny, actually... are you sure that you compiled
=    your FreeBSD-5 system with Witness turned off?

There are no "WITNESS" options in the kernel's config file (unlike in
NOTES). So, unless there has to be some sort of explicit "NOWITNESS", I
am sure.

=:		recent -current on a Centrino-laptop P4-1GHz (NO win at all)
=:	mmap: 4.197u 3.920s 2:07.57 6.3%      65+284k 63+0io 45568pf+0w
=:	read: 3.965u 4.265s 1:50.26 7.4%      67+291k 13131+0io 17pf+0w
=
=    mmap 15% slower.

=:		Linux 2.4.20-30.9bigmem dual P4-3GHz (with a different file)
=:	mmap: 2.280u 4.800s 1:13.39 9.6%      0+0k 0+0io 512434pf+0w
=:	read: 1.630u 2.820s 0:08.89 50.0%     0+0k 0+0io 396pf+0w
=    
=    mmap 821% slower on Linux?  With a different file?  So these numbers
=    can't be compared to anything else (over and above the fact that this
=    machine is three times faster then any of the others).

No, the file is different (as is the processor) -- relative performance
difference only. I was quite surprised myself. My fmd5 program does not
show such a dramatic difference, but `fgrep --mmap' is vastly slower on
Linux, than the regular `fgrep'. Here are the results of the two new
fgrep runs:

	mmap1: 1.450u 3.000s 0:46.00 9.6%      0+0k 0+0io 512439pf+0w
	read1: 1.830u 2.620s 0:09.51 46.7%     0+0k 0+0io 393pf+0w
	mmap2: 1.700u 4.040s 1:02.31 9.2%      0+0k 0+0io 512427pf+0w
	read2: 1.330u 3.150s 0:09.38 47.7%     0+0k 0+0io 396pf+0w

=    I'm not sure why you are complaining about FreeBSD.

Because I have much higher expectations for it :-) I thought, I'll be
able to use the powerful technique of presenting a Linux' superiority in
some area to fire up rapid improvements in the same area in FreeBSD. Now
I'm back to fighting the "12% gain is not worth the effort" mentality.

=:Once mmap-handling is improved, all sorts of whole-file operations
=:(bzip2, gzip, md5, sha1) can be made faster...

=    Well, your numbers don't really say that. It looks like you might
=    eeek out a 10-15% improvement, and while this is faster it really
=    isn't all that much faster. It certainly isn't something to write
=    home about, and certainly not significant enough to warrant major
=    codework.

Put it into perspective -- 10-15% is usually the difference between
the latest processor and the previous one. People are willing to pay
hundreds of dollars premium...

Besides, the differences can be higher. Here is from md5-ing a
2097272832-bytes file over NFS (on a Gigabit network, no jumbo frames).
The machine runs a FreeBSD-current on a single P4 2GHz:

	mmap1: 17.115u 16.106s 2:20.84 23.5%   5+166k 0+0io 253421pf+0w
	read1: 19.468u 12.179s 1:27.80 36.0%   4+163k 0+0io 0pf+0w
	mmap2: 17.214u 13.265s 2:13.75 22.7%   5+165k 1+0io 204842pf+0w
	read2: 19.142u 11.576s 1:20.22 38.2%   4+162k 0+0io 4pf+0w

mmap is 87% slower (or read is 38% faster)! According to `systat -if',
mmap was reading at about 13Mb/s, while read was consistently above
20Mb/s.

If this mmap-associated penalty is removed, the applications can save
some memory by not using the BUFSIZ (or bigger) buffers, and the
systems can save the time and effort of shuffling the memory from
kernel buffers into user space (and flushing the instruction and data
caches). The difference can be big -- on a CPU bound machine the sum
of user time and system time is much smaller with mmap. For example,
on this Solaris box running on Sparc-900MHz md5-ing a 16061698048-byte
file (FreeBSD behaves similarly on the P2 400MHz reported earlier):

	mmap: 215.290u 48.990s 7:18.81 60.2%  0+0k 0+0io 0pf+0w
	read: 184.240u 142.350s 5:46.31 94.3% 0+0k 0+0io 0pf+0w
		(264.28 vs. 326.59 CPU seconds)

but read manages to saturate the CPU better -- 94% vs. 60% -- and win
the "wall clock" race repeatedly...

Yours,

	-mi