read vs. mmap (or io vs. page faults)
Matthew Dillon
dillon at apollo.backplane.com
Tue Jun 22 12:09:23 PDT 2004
(current removed, but I'm leaving this on question@ since it contains
some useful information).
:This is, sort of, self-perpetuating -- as long as mmap is slower/less
:reliable, applications will be hesitant to use it, thus there will be
:little insentive to improve it. :-(
Well, again, this is an incorrect perception. Your use of mmap() to
process huge linear data sets is not what mmap() is best at doing, on
*any* operating system, and not what people use mmap() for most of the
time. There are major hardware related overheads to the use of mmap(),
on *ANY* operating system, that cannot be circumvented. You have no
choice but to allocate the pages for a page table, to populate the pages
with pte's, you must invalidate the pages in the tlb whenever you modify
a page table entry (e.g. invlpg instruction for IA32, which on a P2 is
extremely expensive), and if you are processing huge data sets you also
have to remove the page table entry from the page table when the
underlying data page is reused due to the dataset being larger then
main memory. There are overheads related to each of these issues, and
overheads related the algorithms the operating system *MUST* use to
figure out which pages to remove (on the fly) when the data set does
not fit in main memory, and there are overheads related to the heuristics
the operating system employs to try to predict the memory usage pattern
to perform some read-ahead.
These are hardware and software issues that cannot simply be wished away.
No matter how much you want the concept of memory mapping to be 'free',
it isn't. Memory mapping and management are complex operations for
any operating system, always have been, and always will be.
:I'd rather call attention to my slower -- CPU-bound boxes. On them, the
:total CPU time spent computing md5 of a file is less with mmap -- by a
:noticable margin. But because the CPU is underutilized, the elapsed "wall
:clock" time is higher.
:
:As far as the cache-using statistics, having to do a cache-cache copy
:doubles the cache used, stealing it from other processes/kernel tasks.
But it is also not relevant for this case because the L2 cache is
typically much larger (128K-2MB) then the 8-32K you might use for
your local buffer. What you are complaining about here is going
to wind up being mere microseconds over a multi-minute run.
It's really important, and I can't stress this enough, to not simply
assume what the performance impact of a particular operation will be
by the way it feels to you. Your assumptions are all skewed... you
are assuming that copying is always bad (it isn't), that copying is
always horrendously expensive (it isn't), that memory mapping is always
cheap (it isn't cheap), and that a small bit of cache pollution will have
a huge penalty in time (it doesn't necessary, certainly not for a
reasonably sized user buffer).
I've already told you how to measure these things. Do me a favor and just
run this dd on all of your FreeBSD boxes:
dd if=/dev/zero of=/dev/null bs=32k count=8192
The resulting bytes/sec that it reports is a good guestimate of the
cost of a memory copy (the actual copy rate will be faster since the
times include the read and write system calls, but it's still a reasonable
basis). So in the case of my absolute fastest machine
(an AMD64 3200+ tweaked up a bit):
268435456 bytes transferred in 0.058354 secs (4600128729 bytes/sec)
That means, basically, that it costs 1 second of cpu to copy 4.6 GBytes
of data. On my slowest box, a C3 VIA Samuel 2 cpu (roughly equivalent
to a P2/400Mhz):
268435456 bytes transferred in 0.394222 secs (680924559 bytes/sec)
So the cost is 1 second to copy 680 MBytes of data on my slowest box.
:Here, again, is from my first comparision on the P2 400MHz:
:
: stdio: 56.837u 34.115s 2:06.61 71.8% 66+193k 11253+0io 3pf+0w
: mmap: 72.463u 7.534s 2:34.62 51.7% 5+186k 105+0io 22328pf+0w
Well, the cpu utilization is only 71.8% for the read case, so the box
is obviously I/O bound already.
The real question you should be asking is not why mmap is only using
51.7% of the cpu, but why stdio is only using 71.8% of the cpu. If
you want to make your processing program more efficient, 'fix' stdio
first. You need to:
(1) Figure out the rate at which your processing program reads data in
the best case. You can do this by timing it on a data set that fits
in memory (so no disk I/O is done). Note that it might be bursty,
so the average rate along does not precisely quanity the amount of
buffering that will be needed.
(2) If your hard drive is faster then the datarate, then determine if
the overhead of doing double-buffering is worth keeping the
processing program populated with data on demand. The overhead
of doing double buffering is something akin to:
dd if=file bs=1mb | dd bs=32k > /dev/null
(3) Figure out how much buffering is required to keep the processing
program supplied with data (achieving either 100% cpu utilization or
100% I/O utilization).
#!/bin/csh
#
dd if=file bs=1mb | dd bs=32k | your_processing_program
^^^^^^ ^^^^^ try different buffer sizes to try
to achieve 100% cpu utilization or
100% I/O utilization on the drive.
time ./scriptfile
(4) If this overhead is small enough (less then the 37% of available cpu
you have in the stdio case), then you can use it to front-end your
processing script and achieve an improvement, despite the extra
copying that id does.
(Again, in my last email I gave you the 'dd' lines that you can use
to determine exactly what the copying overhead for a dataset would be,
and gave you live examples showing that, usually, it's quite small
compared to the total run time of a typical processing program).
Don't just assume that copying is bad, or that extra stages are bad,
because the reality is that they might not be in an I/O bound situation.
You have to measure the actual overhead to see what the actual cost is.
My backup script uses dd to double buffer for precisely this reason,
though in my case I do it because 'dump' output it quite bursty and
sometimes it blocks waiting for gzip when, really, it shouldn't have to.
Here is a section out of my backup script:
ssh $host -l operator $sshopts "dump ${level}auCbf 32 64 - $i" | \
dd obs=1m | dd obs=1m | gzip -6 > $file.tmp
I would never, ever expect the operating system to buffer that much
data ahead of a program, nor should the OS do that, so I do it myself.
The cost is a pittance. I waste 1% of the cpu in order to gain about
18% in real time by allowing dump to more fully utilize the disk it is
dumping.
:Or is P2 400MHz not modern? May be, but the very modern Sparcs, on which
:FreeBSD intends to run are not much faster.
A 400 MHz P2 is 1/3 as fast as the LOWEST END AMD XP cpu you can buy
today, and 5-10 times slower then higher-end Intel and AMD cpus.
I would say that that makes it 'not modern'.
We aren't talking 15% here. We are talking 300%-1000%.
:= The mmap interface is not supposed to be more efficient, per say.
:= Why would it be?
:
:Puzzling question. Because the kernel is supplied with more information
:-- it knows, that I only plan to _read_ from the memory (PROT_READ),
:the total size of what I plan to read (mmap's len, optionally,
:madvise's len), and (optionally), that I plan to read sequentially
:(MADV_SEQUENTIONAL).
Well, this is not correct. The kernel has just as much information
when you use read().
Furthermore, you are making the assumption that the kernel should
read-ahead an arbitrary amount of data. It could very well be that
the burstiness of your processing program requires a megabyte or more
worth of read-ahead to keep the cpu saturated.
The kernel will never do this, because dedicating that much memory to
a single I/O stream is virtually guarenteed to be detrimental to the
rest of the system (everything else running on the system).
The kernel will not do this, but you certainly can, either by
double-buffering the stream or by following Julian's excellent suggestion
to fork() a helper thread to read that far ahead.
:Mmap also needs no CPU data-cache to read. If the device is capable of
:writing to memory directly (DMA?), the CPU does not need to be involved
:at all, while with read the data still has to go from the DMA-filled
:kernel buffer to the application buffer -- there being two copies of it
:in cache instead of none for just storing or one copy for processing.
In most cases the CPU is not involved at all when you mmap() data until
you access it via the mmap(). However, that does not mean that the memory
subsystem is not involved. The CPU must still load the data you access
into the L1/L2 caches from main memory when you access it, so the memory
overhead is still there and still (typically) 5 times greater then the
additional memory overhead required to do a buffer copy in the read()
case. When you add in the overhead of processing the data, which is
typically 10-50 times the cost of reading it in the first place, then
the 'waste' from the extra buffer copy winds up being in the noise.
So, as I said in my previous email, it comes down to how much it costs
to do a local copy within the L2 cache (the read() case), verses how
much extra overhead is involved in the mmap case. And, as I stated
previously, L1 and L2 cache bandwidth is so high these days that it
really doesn't take all that much overhead to match (and then exceed)
the time it takes to do the local copy.
:Also, in case of RAM shortage, mmap-ed pages can be just dropped, while
:the too large buffer needs to be written into swap.
Huh? No, that isn't true. Your too-large buffer might still only be
a megabyte, whereas your mmap()'d data might be a gigabyte. Since you
are utilizing the buffer over and over again its pages are NOT likely
to ever be written to swap.
:And mmap requires no application buffers -- win, win, and win. Is there
:an inherent "lose" somewhere, I don't see? Like:
Again, you aren't listening to what I said about how the L1/L2 cache
works. You really have to listen. APPLICATION BUFFERS WHICH EASILY
FIT IN THE L2 CACHE COST VIRTUALLY NOTHING ON A MODERN CPU! I even
gave you a 'dd' test you could perform on FreeBSD to measure the cost.
It is almost impossible to beat 'virtually nothing'.
:A database, that returns results 15%, nay, even 5% faster is also a
:better database.
:...
:What are we arguing about? Who wouldn't take a 2.2GHz processor over a
:2GHz one -- other things being equal -- and they are?
:..
: -mi
Which is part of the problem. You are not taking into account cost
considerations when you say that. You are paying a premium to buy
a cpu that is only 15% faster. If it were free, or cost a pittance,
I would take the 2.2GHz cpu. But it isn't free, and for a high-end cpu
15% can be $400 (or more) that's why it generally isn't worth it for
a mere 15%. The money can be spent on other things that are just
as important: memory, another disk (double your disk throughput),
GigE network card, even a whole new machine so you now have two
slightly slower machines (200%) rather then one slightly faster machine
(115%).
-Matt
Matthew Dillon
<dillon at backplane.com>
More information about the freebsd-questions
mailing list