Reading via mmap stinks (Re: weird bugs with mmap-ing via NFS)
dillon at apollo.backplane.com
Fri Mar 24 00:29:40 UTC 2006
:I thought one serious advantage to this situation for sequential read
:mmap() is to madvise(MADV_DONTNEED) so that the pages don't have to
:wait for the clock hands to reap them. On a large Solaris box I used
:to have the non-pleasure of running the VM page scan rate was high, and
:I suggested to the app vendor that proper use of mmap might reduce that
:overhead. Admitedly the files in question were much smaller than the
:available memory, but they were also not likely to be referenced again
:before the memory had to be reclaimed forcibly by the VM system.
:Is that not the case? Is it better to let the VM system reclaim pages
madvise() should theoretically have that effect, but it isn't quite
so simple a solution.
Lets say you have, oh, your workstation, with 1GB of ram, and you
run a program which runs several passes on a 900MB data set.
Your X session, xterms, gnome, kde, etc etc etc all take around 300MB
of working memory.
Now that data set could fit into memory if portions of your UI were
pushed out of memory. The question is not only how much of that data
set should the kernel fit into memory, but which portions of that data
set should the kernel fit into memory and whether the kernel should
bump out other data (pieces of your UI) to make it fit.
Scenario #1: If the kernel fits the whole 900MB data set into memory,
the entire rest of the system would have to compete for the remaining
100MB of memory. Your UI would suck rocks.
Scenario #2: If the kernel fits 700MB of the data set into memory, and
the rest of the system (your UI, etc) is only using 300MB, and the kernel
is using MADV_DONTNEED on pages it has already scanned, now your UI
works fine but your data set processing program is continuously
accessing the disk for all 900MB of data, on every pass, because the
kernel is always only keeping the most recently accessed 700MB of
the 900MB data set in memory.
Scenario #3: Now lets say the kernel decides to keep just the first
700MB of the data set in memory, and not try to cache the last 200MB
of the data set. Now your UI works fine, and your processing program
runs FOUR TIMES FASTER because it only has to access the disk for
the last 200MB of the 900MB data set.
Now, which of these scenarios does madvise() cover? Does it cover
scenario #1? Well, no. the madvise() call that the program makes has
no clue whether you intend to play around with your UI every few minutes,
or whether you intend to leave the room for 40 minutes. If the kernel
guesses wrong, we wind up with one unhappy user.
What about scenario #2? There the program decided to call madvise(),
and the system dutifully reuses the pages, and you come back an hour
later and your data processing program has only done 10 passes out
of the 50 passes it needs to do on the data and you are PISSED.
Ok. What about scenario #3? Oops. The program has no way of knowing
how much memory you need for your UI to be 'happy'. No madvise() call
of any sort will make you happy. Not only that, but the KERNEL has no
way of knowing that your data processing program intends to make
multiple passes on the data set, whether the working set is represented
by one file or several files, and even the data processing program itself
might not know (you might be running a script which runs a separate
program for each pass on the same data set).
So much for madvise().
So, no matter what, there will ALWAYS be an unhappy user somewhere. Lets
take Mikhail's grep test as an example. If he runs it over and over
again, should the kernel be 'optimized' to realize that the same data
set is being scanned sequentially, over and over again, ignore the
localized sequential nature of the data accesses, and just keep a
dedicated portion of that data set in memory to reduce long term
disk access? Should it keep the first 1.5GB, or the last 1.5GB,
or perhaps it should slice the data set up and keep every other 256MB
block? How does it figure out what to cache and when? What if the
program suddenly starts accessing the data in a cacheable way?
Maybe it should randomly throw some of the data away slowly in the hopes
of 'adapting' to the access pattern, which would also require that it
throw away most of the 'recently read' data far more quickly to make
up for the data it isn't throwing away. Believe it or not, that
actually works for certain types of problems, except then you get hung
up in a situation where two subsystems are competing with each other
for memory resources (like mail server verses web server), and the
system is unable to cope as the relative load factors for the competing
subsystems change. The problem becomes really complex really fast.
This sort of problem is easy to consider in human terms, but virtually
impossible to program into a computer with a heuristic or even with
specific madvise() calls. The computer simply does not know what the
human operator expects from one moment to the next.
The problem Mikhail is facing is one where his human assumptions do not
match the assumptions the kernel is making on data retention, assumed
system load, and the many other factors that the kernel uses to decide
what to keep and what to throw away, and when.
Now, aside from the potential read-ahead issue, which could be a real
issue for FreeBSD (but not one really worthy of insulting someone over),
there is literally no way for a kernel programmer to engineer the
'perfect' set of optimizations for a system. There are a huge
number of pits you can fall into if you try to over-optimize
a system. Each optimization adds that much more complexity to an already
complex system, and has that much greater a chance to introduce yet
another hard-to-find bug.
Nearly all operating systems that I know of tend to presume a certain
degree of locality of reference for mmap()'d pages. It just so happens
that Mikhail's test has no locality of reference. But 99.9% of the
programs ever run on a BSD system WILL, so which should the kernel
programmer spend all his time coding optimizations for? The 99.9% of
the time or the 0.1% of the time?
More information about the freebsd-stable