Reading via mmap stinks (Re: weird bugs with mmap-ing via NFS)

Matthew Dillon dillon at apollo.backplane.com
Thu Mar 23 20:48:21 UTC 2006


:Actually, I can not agree here -- quite the opposite seems true. When running 
:locally (no NFS involved) my compressor with the `-1' flag (fast, least 
:effective compression), the program easily compresses faster, than it can 
:read.
:
:The Opteron CPU is about 50% idle, *and so is the disk* producing only 15Mb/s. 
:I guess, despite the noise I raised on this subject a year ago, reading via 
:mmap continues to ignore the MADV_SEQUENTIONAL and has no other adaptability.
:
:Unlike read, which uses buffering, mmap-reading still does not pre-fault the 
:file's pieces in efficiently :-(
:
:Although the program was written to compress files, that are _likely_ still in 
:memory, when used with regular files, it exposes the lack of mmap 
:optimization.
:
:This should be even more obvious, if you time searching for a string in a 
:large file using grep vs. 'grep --mmap'.
:
:Yours,
:
:	-mi
:
:http://aldan.algebra.com/~mi/mzip.c

    Well, I don't know about FreeBSD, but both grep cases work just fine on
    DragonFly.  I can't test mzip.c because I don't see the compression
    library you are calling (maybe that's a FreeBSD thing).  The results
    of the grep test ought to be similar for FreeBSD since the heuristic
    used by both OS's is the same.  If they aren't, something might have
    gotten nerfed accidently in the FreeBSD tree.

    Here is the cache case test.  mmap is clearly faster (though I would
    again caution that this should not be an implicit assumption since
    VM fault overheads can rival read() overheads, depending on the
    situation).

    The 'x1' file in all tests below is simply /usr/share/dict/words
    concactenated over and over again to produce a large file.

crater# ls -la x1
-rw-r--r--  1 root  wheel  638228992 Mar 23 11:36 x1
[ machine has 1GB of ram ]

crater# time grep --mmap asdfasf x1
1.000u 0.117s 0:01.11 100.0%    10+40k 0+0io 0pf+0w
crater# time grep --mmap asdfasf x1
0.976u 0.132s 0:01.13 97.3%     10+40k 0+0io 0pf+0w
crater# time grep --mmap asdfasf x1
0.984u 0.140s 0:01.11 100.9%    10+41k 0+0io 0pf+0w

crater# time grep asdfasf x1
0.601u 0.781s 0:01.40 98.5%     10+42k 0+0io 0pf+0w
crater# time grep asdfasf x1
0.507u 0.867s 0:01.39 97.8%     10+40k 0+0io 0pf+0w
crater# time grep asdfasf x1
0.562u 0.812s 0:01.43 95.8%     10+41k 0+0io 0pf+0w

crater# iostat 1
[ while grep is running, in order to test the cache case and verify that
  no I/O is occuring once the data has been cached ]


    The disk I/O case, which I can test by unmounting and remounting the
    partition containing the file in question, then running grep, seems
    to be well optimized on DragonFly.  It should be similarly optimized
    on FreeBSD since the code that does this optimization is nearly the
    same.  In my test, it is clear that the page-fault overhead in the
    uncached case is considerably greater then the copying overhead of
    a read(), though not by much.  And I would expect that, too.

test28# umount /home
test28# mount /home
test28# time grep asdfasdf /home/x1
0.382u 0.351s 0:10.23 7.1%      55+141k 42+0io 4pf+0w
test28# umount /home
test28# mount /home
test28# time grep asdfasdf /home/x1
0.390u 0.367s 0:10.16 7.3%      48+123k 42+0io 0pf+0w

test28# umount /home
test28# mount /home
test28# time grep --mmap asdfasdf /home/x1
0.539u 0.265s 0:10.53 7.5%      36+93k 42+0io 19518pf+0w
test28# umount /home
test28# mount /home
test28# time grep --mmap asdfasdf /home/x1
0.617u 0.289s 0:10.47 8.5%      41+105k 42+0io 19518pf+0w
test28# 

test28# iostat 1 during the test showed ~60MBytes/sec for all four tests

    Perhaps you should post specifics of the test you are running, as well
    as specifics of the results you are getting, such as the actual timing
    output instead of a human interpretation of the results.  For that
    matter, being an opteron system, were you running the tests on a UP
    system or an SMP system?  grep is a single-threaded so on a 2-cpu
    system it will show 50% cpu utilization since one cpu will be 
    saturated and the other idle.  With specifics, a FreeBSD person can
    try to reproduce your test results.

    A grep vs grep --mmap test is pretty straightforward and should be
    a good test of the VM read-ahead code, but there might always be some
    unknown circumstance specific to a machine configuration that is
    the cause of the problem.  Repeatability and reproducability by
    third parties is important when diagnosing any problem.

    Insofar as MADV_SEQUENTIAL goes... you shouldn't need it on FreeBSD.
    Unless someone ripped it out since I committed it many years ago, which
    I doubt, FreeBSD's VM heuristic will figure out that the accesses
    are sequential and start issuing read-aheads.  It should pre-fault, and
    it should do read-ahead.  That isn't to say that there isn't a bug, just
    that everyone interested in the problem has to be able to reproduce it
    and help each other track down the source.  Just making an assumption
    and accusation with regards to the cause of the problem doesn't solve 
    it.

    The VM system is rather fragile when it comes to read-ahead because
    the only way to do read-ahead on mapped memory is to issue the
    read-ahead and then mark some prior (already cached) page as
    inaccessible in order to be able to take a VM fault and issue the
    NEXT read-ahead before the program exhausts the current cached data.
    It is, in fact, rather complex code, not straightforward as you
    might expect.

    But I can only caution you, again, on making the assumption that the
    operating system should optimize your particular test case intuitively,
    like a human would.   Operating systems generaly optimize the most
    common cases, but it would be pretty dumb to actually try to make
    them optimize every conceivable case.  You would wind up with hundreds
    of thousands of lines of barely exercised and likely buggy code.

						-Matt



More information about the freebsd-stable mailing list