Reading via mmap stinks (Re: weird bugs with mmap-ing via NFS)

Matthew Dillon dillon at
Fri Mar 24 00:29:40 UTC 2006

:I thought one serious advantage to this situation for sequential read
:mmap() is to madvise(MADV_DONTNEED) so that the pages don't have to
:wait for the clock hands to reap them.  On a large Solaris box I used
:to have the non-pleasure of running the VM page scan rate was high, and
:I suggested to the app vendor that proper use of mmap might reduce that
:overhead.  Admitedly the files in question were much smaller than the
:available memory, but they were also not likely to be referenced again
:before the memory had to be reclaimed forcibly by the VM system.
:Is that not the case?  Is it better to let the VM system reclaim pages
:as needed?

    madvise() should theoretically have that effect, but it isn't quite
    so simple a solution.

    Lets say you have, oh, your workstation, with 1GB of ram, and you
    run a program which runs several passes on a 900MB data set.
    Your X session, xterms, gnome, kde, etc etc etc all take around 300MB
    of working memory.

    Now that data set could fit into memory if portions of your UI were
    pushed out of memory.  The question is not only how much of that data
    set should the kernel fit into memory, but which portions of that data
    set should the kernel fit into memory and whether the kernel should
    bump out other data (pieces of your UI) to make it fit.

    Scenario #1:  If the kernel fits the whole 900MB data set into memory,
    the entire rest of the system would have to compete for the remaining
    100MB of memory.  Your UI would suck rocks.

    Scenario #2: If the kernel fits 700MB of the data set into memory, and
    the rest of the system (your UI, etc) is only using 300MB, and the kernel
    is using MADV_DONTNEED on pages it has already scanned, now your UI
    works fine but your data set processing program is continuously 
    accessing the disk for all 900MB of data, on every pass, because the
    kernel is always only keeping the most recently accessed 700MB of
    the 900MB data set in memory.

    Scenario #3: Now lets say the kernel decides to keep just the first
    700MB of the data set in memory, and not try to cache the last 200MB
    of the data set.  Now your UI works fine, and your processing program
    runs FOUR TIMES FASTER because it only has to access the disk for
    the last 200MB of the 900MB data set.


    Now, which of these scenarios does madvise() cover?  Does it cover
    scenario #1?  Well, no.  the madvise() call that the program makes has
    no clue whether you intend to play around with your UI every few minutes,
    or whether you intend to leave the room for 40 minutes.  If the kernel
    guesses wrong, we wind up with one unhappy user.  

    What about scenario #2?  There the program decided to call madvise(),
    and the system dutifully reuses the pages, and you come back an hour
    later and your data processing program has only done 10 passes out
    of the 50 passes it needs to do on the data and you are PISSED.

    Ok.  What about scenario #3?  Oops.  The program has no way of knowing
    how much memory you need for your UI to be 'happy'.  No madvise() call
    of any sort will make you happy.  Not only that, but the KERNEL has no
    way of knowing that your data processing program intends to make
    multiple passes on the data set, whether the working set is represented
    by one file or several files, and even the data processing program itself
    might not know (you might be running a script which runs a separate
    program for each pass on the same data set).

    So much for madvise().

    So, no matter what, there will ALWAYS be an unhappy user somewhere.  Lets
    take Mikhail's grep test as an example.  If he runs it over and over
    again, should the kernel be 'optimized' to realize that the same data
    set is being scanned sequentially, over and over again, ignore the
    localized sequential nature of the data accesses, and just keep a
    dedicated portion of that data set in memory to reduce long term
    disk access?  Should it keep the first 1.5GB, or the last 1.5GB,
    or perhaps it should slice the data set up and keep every other 256MB
    block?  How does it figure out what to cache and when?  What if the
    program suddenly starts accessing the data in a cacheable way?

    Maybe it should randomly throw some of the data away slowly in the hopes
    of 'adapting' to the access pattern, which would also require that it
    throw away most of the 'recently read' data far more quickly to make
    up for the data it isn't throwing away.  Believe it or not, that
    actually works for certain types of problems, except then you get hung
    up in a situation where two subsystems are competing with each other
    for memory resources (like mail server verses web server), and the
    system is unable to cope as the relative load factors for the competing
    subsystems change.  The problem becomes really complex really fast.

    This sort of problem is easy to consider in human terms, but virtually
    impossible to program into a computer with a heuristic or even with 
    specific madvise() calls.  The computer simply does not know what the
    human operator expects from one moment to the next.

    The problem Mikhail is facing is one where his human assumptions do not
    match the assumptions the kernel is making on data retention, assumed
    system load, and the many other factors that the kernel uses to decide
    what to keep and what to throw away, and when.


    Now, aside from the potential read-ahead issue, which could be a real
    issue for FreeBSD (but not one really worthy of insulting someone over),
    there is literally no way for a kernel programmer to engineer the
    'perfect' set of optimizations for a system.  There are a huge
    number of pits you can fall into if you try to over-optimize
    a system.  Each optimization adds that much more complexity to an already
    complex system, and has that much greater a chance to introduce yet 
    another hard-to-find bug.

    Nearly all operating systems that I know of tend to presume a certain
    degree of locality of reference for mmap()'d pages.  It just so happens
    that Mikhail's test has no locality of reference.  But 99.9% of the
    programs ever run on a BSD system WILL, so which should the kernel
    programmer spend all his time coding optimizations for?  The 99.9% of
    the time or the 0.1% of the time?


More information about the freebsd-stable mailing list