read vs. mmap (or io vs. page faults)

Mon Jun 21 12:52:56 PDT 2004

:
:= pre-faulting is best done by a worker thread or child process, or it
:= will just slow you down..
:
:Read is also used for large files sometimes, and never tries to prefetch
:the whole file at once. Why can't the same smarts/heuristics be employed
:by the page-fault handling code -- especially, if we are so proud of our
:unified caching?

    Both read and mmap have a read-ahead heuristic.  The heuristic works.
    In fact, the mmap heuristic is so smart it can read-behind as well as
    read-ahead if it detects a backwards scan.  The heuristic does not try 
    to read megabytes and megabytes ahead, however... that might speed up
    this particular application a little, but it would destroy performance
    for many other types of applications, especially in a loaded environment.

:If anything mmap/madvise provide the kernel with _more_ information than
:read -- kernel just does not use it, it seems.
:
:According to my tests (`fgrep string /huge/file' vs. `fgrep --mmap
:string /huge/file') the total CPU time is much less with mmap. But
:sometimes the total "wall clock" time is longer with itj because the CPU
:is underutilized, when using the mmap method.

    Well now hold a second... the best you can do here is compare relative
    differences between mmap and read.  All of these machines are different,
    with different cpus and different configurations.  For example, a 
    duel-P2 is going to be horrendously bad doing SMP things because the P2's
    locked bus cycle instruction overhead is horrendous.  That is going to
    seriously skew the results.  There are major architectural differences
    between these cpus... cache size, memory bandwidth, MP operations 
    overhead, not to mention raw megaherz.  Disk transfer rate and the
    disk bus interface and driver will also make a big difference here,
    as well as the contents of the file you are fgrep'ing.

    If you really want to compare operating systems, you have to run the
    OS's and the tests on the same hardware.

:		4.8-stable on Pentium2-400MHz
:	mmap: 21.507u 11.472s 1:27.53 37.6%   62+276k 99+0io 44736pf+0w
:	read: 10.619u 23.814s 1:17.67 44.3%   62+274k 11255+0io 0pf+0w

    mmap 12% slower then read.  12% isn't much.

:		recent -current on dual P2 Xeon-450MHz (mmap WINS -- SMP?)
:	mmap: 12.482u 12.872s 2:28.70 17.0%   74+298k 23+0io 46522pf+0w
:	read: 7.255u 16.366s 3:27.07 11.4%    70+283k 44437+0io 7pf+0w

    mmap 39% faster.  That's a significant difference.

    It kinda smells funny, actually... are you sure that you compiled
    your FreeBSD-5 system with Witness turned off?

:		recent -current on a Centrino-laptop P4-1GHz (NO win at all)
:	mmap: 4.197u 3.920s 2:07.57 6.3%      65+284k 63+0io 45568pf+0w
:	read: 3.965u 4.265s 1:50.26 7.4%      67+291k 13131+0io 17pf+0w

    mmap 15% slower.

:		Linux 2.4.20-30.9bigmem dual P4-3GHz (with a different file)
:	mmap: 2.280u 4.800s 1:13.39 9.6%      0+0k 0+0io 512434pf+0w
:	read: 1.630u 2.820s 0:08.89 50.0%     0+0k 0+0io 396pf+0w

    mmap 821% slower on Linux?  With a different file?  So these numbers
    can't be compared to anything else (over and above the fact that this
    machine is three times faster then any of the others).

    It kinda looks like either you wrote the linux numbers down wrong,
    or linux's mmap is much, much worse then FreeBSD's.  I'm not sure why
    you are complaining about FreeBSD.  If I were to assume 1:08.89 instead
    of 1:13.39 the difference would be 6.5%, which is narrower then 15%
    but not by all that much... a few seconds is nothing to quibble over.

:The attached md5-computing program is more CPU consuming than fgrep. It
:wins with mmap even on the "sceptical" Centrino-laptop -- presumably,
:because MD5_Update is not interrupted as much and remains in the
:instruction cache:
:
:	read: 22.024u 8.418s 1:28.44 34.4%    5+166k 10498+0io 4pf+0w
:	mmap: 21.428u 3.086s 1:23.88 29.2%    5+170k 40+0io 19649pf+0w

    read is 6% faster then mmap here.

:Once mmap-handling is improved, all sorts of whole-file operations
:(bzip2, gzip, md5, sha1) can be made faster...
:
:	-mi

    Well, your numbers don't really say that.  It looks like you might
    eeek out a 10-15% improvement, and while this is faster it really isn't
    all that much faster.  It certainly isn't something to write home about,
    and certainly not significant enough to warrent major codework.

    Though I personally have major issues with FreeBSD-5's performance
    in general, I don't really see that anything stands out in these tests
    except perhaps for FreeBSD-5's horrible MP performance with read() vs
    mmap() on the duel P2 (but I suspect that might be due to some other 
    issue such as perhaps Witness being turned on).

    If you really want to get comparative results you have to run all of
    these tests on the same hardware with the same file.  In fact, I would
    run them over a suite of file sizes since a lot of this is going depend
    on the buffer cache's KVA mappings.  I would still expect linux to beat
    out FreeBSD-5 fairly handily, but the FreeBSD-4 vs Linux numbers would
    likely be a whole lot closer.

						-Matt