read vs. mmap (or io vs. page faults)

Tue Jun 22 12:09:23 PDT 2004

    (current removed, but I'm leaving this on question@ since it contains
    some useful information).

:This is, sort of, self-perpetuating -- as long as mmap is slower/less
:reliable, applications will be hesitant to use it, thus there will be
:little insentive to improve it. :-(

    Well, again, this is an incorrect perception.  Your use of mmap() to
    process huge linear data sets is not what mmap() is best at doing, on
    *any* operating system, and not what people use mmap() for most of the
    time.  There are major hardware related overheads to the use of mmap(),
    on *ANY* operating system, that cannot be circumvented.  You have no
    choice but to allocate the pages for a page table, to populate the pages
    with pte's, you must invalidate the pages in the tlb whenever you modify
    a page table entry (e.g. invlpg instruction for IA32, which on a P2 is
    extremely expensive), and if you are processing huge data sets you also
    have to remove the page table entry from the page table when the
    underlying data page is reused due to the dataset being larger then
    main memory.  There are overheads related to each of these issues, and 
    overheads related the algorithms the operating system *MUST* use to
    figure out which pages to remove (on the fly) when the data set does 
    not fit in main memory, and there are overheads related to the heuristics
    the operating system employs to try to predict the memory usage pattern
    to perform some read-ahead.

    These are hardware and software issues that cannot simply be wished away. 
    No matter how much you want the concept of memory mapping to be 'free',
    it isn't.  Memory mapping and management are complex operations for
    any operating system, always have been, and always will be.

:I'd rather call attention to my slower -- CPU-bound boxes. On them, the
:total CPU time spent computing md5 of a file is less with mmap -- by a
:noticable margin. But because the CPU is underutilized, the elapsed "wall
:clock" time is higher.
:
:As far as the cache-using statistics, having to do a cache-cache copy
:doubles the cache used, stealing it from other processes/kernel tasks.

    But it is also not relevant for this case because the L2 cache is
    typically much larger (128K-2MB) then the 8-32K you might use for
    your local buffer.  What you are complaining about here is going
    to wind up being mere microseconds over a multi-minute run.

    It's really important, and I can't stress this enough, to not simply 
    assume what the performance impact of a particular operation will be
    by the way it feels to you.  Your assumptions are all skewed... you
    are assuming that copying is always bad (it isn't), that copying is
    always horrendously expensive (it isn't), that memory mapping is always
    cheap (it isn't cheap), and that a small bit of cache pollution will have
    a huge penalty in time (it doesn't necessary, certainly not for a 
    reasonably sized user buffer). 

    I've already told you how to measure these things.  Do me a favor and just
    run this dd on all of your FreeBSD boxes:

    dd if=/dev/zero of=/dev/null bs=32k count=8192

    The resulting bytes/sec that it reports is a good guestimate of the
    cost of a memory copy (the actual copy rate will be faster since the
    times include the read and write system calls, but it's still a reasonable
    basis).  So in the case of my absolute fastest machine
    (an AMD64 3200+ tweaked up a bit):

    268435456 bytes transferred in 0.058354 secs (4600128729 bytes/sec)

    That means, basically, that it costs 1 second of cpu to copy 4.6 GBytes
    of data.  On my slowest box, a C3 VIA Samuel 2 cpu (roughly equivalent
    to a P2/400Mhz):

    268435456 bytes transferred in 0.394222 secs (680924559 bytes/sec)

    So the cost is 1 second to copy 680 MBytes of data on my slowest box.

:Here, again, is from my first comparision on the P2 400MHz:
:
:	stdio: 56.837u 34.115s 2:06.61 71.8%   66+193k 11253+0io 3pf+0w
:	mmap:  72.463u  7.534s 2:34.62 51.7%   5+186k  105+0io   22328pf+0w

    Well, the cpu utilization is only 71.8% for the read case, so the box
    is obviously I/O bound already.

    The real question you should be asking is not why mmap is only using
    51.7% of the cpu, but why stdio is only using 71.8% of the cpu.  If
    you want to make your processing program more efficient, 'fix' stdio
    first.  You need to:

    (1) Figure out the rate at which your processing program reads data in
	the best case.  You can do this by timing it on a data set that fits
	in memory (so no disk I/O is done).  Note that it might be bursty,
	so the average rate along does not precisely quanity the amount of
	buffering that will be needed.

    (2) If your hard drive is faster then the datarate, then determine if
	the overhead of doing double-buffering is worth keeping the
	processing program populated with data on demand.  The overhead
	of doing double buffering is something akin to:

	dd if=file bs=1mb | dd bs=32k > /dev/null

    (3) Figure out how much buffering is required to keep the processing
	program supplied with data (achieving either 100% cpu utilization or
	100% I/O utilization).

	#!/bin/csh
	#
	dd if=file bs=1mb | dd bs=32k | your_processing_program

		   ^^^^^^       ^^^^^  try different buffer sizes to try
					to achieve 100% cpu utilization or
					100% I/O utilization on the drive.

	time ./scriptfile

    (4) If this overhead is small enough (less then the 37% of available cpu
	you have in the stdio case), then you can use it to front-end your
	processing script and achieve an improvement, despite the extra
	copying that id does.

	(Again, in my last email I gave you the 'dd' lines that you can use
	to determine exactly what the copying overhead for a dataset would be,
	and gave you live examples showing that, usually, it's quite small
	compared to the total run time of a typical processing program).

    Don't just assume that copying is bad, or that extra stages are bad, 
    because the reality is that they might not be in an I/O bound situation.
    You have to measure the actual overhead to see what the actual cost is.

    My backup script uses dd to double buffer for precisely this reason,
    though in my case I do it because 'dump' output it quite bursty and
    sometimes it blocks waiting for gzip when, really, it shouldn't have to.
    Here is a section out of my backup script:

	ssh $host -l operator $sshopts "dump ${level}auCbf 32 64 - $i" | \
		dd obs=1m | dd obs=1m | gzip -6 > $file.tmp

    I would never, ever expect the operating system to buffer that much 
    data ahead of a program, nor should the OS do that, so I do it myself.
    The cost is a pittance.  I waste 1% of the cpu in order to gain about
    18% in real time by allowing dump to more fully utilize the disk it is
    dumping.

:Or is P2 400MHz not modern? May be, but the very modern Sparcs, on which
:FreeBSD intends to run are not much faster.

    A 400 MHz P2 is 1/3 as fast as the LOWEST END AMD XP cpu you can buy
    today, and 5-10 times slower then higher-end Intel and AMD cpus.
    I would say that that makes it 'not modern'.

    We aren't talking 15% here.  We are talking 300%-1000%.

:=     The mmap interface is not supposed to be more efficient, per say.
:=     Why would it be?
:
:Puzzling question. Because the kernel is supplied with more information
:-- it knows, that I only plan to _read_ from the memory (PROT_READ),
:the total size of what I plan to read (mmap's len, optionally,
:madvise's len), and (optionally), that I plan to read sequentially
:(MADV_SEQUENTIONAL).

    Well, this is not correct.  The kernel has just as much information
    when you use read().

    Furthermore, you are making the assumption that the kernel should
    read-ahead an arbitrary amount of data.  It could very well be that
    the burstiness of your processing program requires a megabyte or more
    worth of read-ahead to keep the cpu saturated.

    The kernel will never do this, because dedicating that much memory to
    a single I/O stream is virtually guarenteed to be detrimental to the
    rest of the system (everything else running on the system). 

    The kernel will not do this, but you certainly can, either by
    double-buffering the stream or by following Julian's excellent suggestion
    to fork() a helper thread to read that far ahead.

:Mmap also needs no CPU data-cache to read. If the device is capable of
:writing to memory directly (DMA?), the CPU does not need to be involved
:at all, while with read the data still has to go from the DMA-filled
:kernel buffer to the application buffer -- there being two copies of it
:in cache instead of none for just storing or one copy for processing.

    In most cases the CPU is not involved at all when you mmap() data until
    you access it via the mmap().  However, that does not mean that the memory
    subsystem is not involved.  The CPU must still load the data you access
    into the L1/L2 caches from main memory when you access it, so the memory
    overhead is still there and still (typically) 5 times greater then the
    additional memory overhead required to do a buffer copy in the read() 
    case.  When you add in the overhead of processing the data, which is 
    typically 10-50 times the cost of reading it in the first place, then
    the 'waste' from the extra buffer copy winds up being in the noise.

    So, as I said in my previous email, it comes down to how much it costs
    to do a local copy within the L2 cache (the read() case), verses how
    much extra overhead is involved in the mmap case.  And, as I stated
    previously, L1 and L2 cache bandwidth is so high these days that it
    really doesn't take all that much overhead to match (and then exceed)
    the time it takes to do the local copy.

:Also, in case of RAM shortage, mmap-ed pages can be just dropped, while
:the too large buffer needs to be written into swap.

    Huh?  No, that isn't true.  Your too-large buffer might still only be
    a megabyte, whereas your mmap()'d data might be a gigabyte.  Since you
    are utilizing the buffer over and over again its pages are NOT likely
    to ever be written to swap.

:And mmap requires no application buffers -- win, win, and win. Is there
:an inherent "lose" somewhere, I don't see? Like:

    Again, you aren't listening to what I said about how the L1/L2 cache
    works.  You really have to listen.  APPLICATION BUFFERS WHICH EASILY 
    FIT IN THE L2 CACHE COST VIRTUALLY NOTHING ON A MODERN CPU!  I even
    gave you a 'dd' test you could perform on FreeBSD to measure the cost.
    It is almost impossible to beat 'virtually nothing'.

:A database, that returns results 15%, nay, even 5% faster is also a
:better database.
:...
:What are we arguing about? Who wouldn't take a 2.2GHz processor over a
:2GHz one -- other things being equal -- and they are?
:..
:	-mi

    Which is part of the problem.  You are not taking into account cost
    considerations when you say that.  You are paying a premium to buy
    a cpu that is only 15% faster.  If it were free, or cost a pittance,
    I would take the 2.2GHz cpu.  But it isn't free, and for a high-end cpu
    15% can be $400 (or more) that's why it generally isn't worth it for
    a mere 15%.  The money can be spent on other things that are just 
    as important: memory, another disk (double your disk throughput),
    GigE network card, even a whole new machine so you now have two 
    slightly slower machines (200%) rather then one slightly faster machine
    (115%).

					-Matt
					Matthew Dillon 
					<dillon at backplane.com>