weird bugs with mmap-ing via NFS

Matthew Dillon dillon at
Tue Mar 21 22:48:37 UTC 2006

:	[Moved from -current to -stable]
:צ×ÔÏÒÏË 21 ÂÅÒÅÚÅÎØ 2006 16:23, Matthew Dillon ÷É ÎÁÐÉÓÁÌÉ:
:> š š You might be doing just writes to the mmap()'d memory, but the system
:> š š doesn't know that.
:Actually, it does. The program tells it, that I don't care to read, what's 
:currently there, by specifying the PROT_READ flag only.

    That's an architectural flag.  Very few architectures actually support
    write-only memory maps.  IA32 does not.  It does not change the
    fact that the operating system must validate the memory underlying
    the page, nor does it imply that the system shouldn't.

:Sounds like a missed optimization opportunity :-(

    Even on architectures that did support write-only memory maps, the
    system would still have to fault in the rest of the data on the page,
    because the system would have no way of knowing which bytes in the 
    page you wrote to (that is, whether you wrote to all the bytes in the
    page or whether you left gaps).  The system does not take a fault for
    every write you issue to the page, only for the first one.  So, no 
    matter how you twist it, the system *MUST* validate the entire page
    when it takes the page fault.

:> š š It kinda sounds like the buffer cache is getting blown out, but not
:> š š having seen the program I can't really analyze it.

    I can't access this URL, it says 'not found'.

:> š š It will always be more efficient to write to a file using write() then
:> š š using mmap()
:I understand, that write() is much better optimized at the moment, but the 
:mmap interface carries some advantages, which may allow future OSes to 
:optimize their ways. The application can hint at its planned usage of the 
:data via madvise, for example.

    Yes, but those advantages are limited by the way memory mapping hardware
    works.  There are some things that simply cannot be optimized through
    lack of sufficient information.

    Reading via mmap() is very well optimized.  Making modifications via
    mmap() is optimized insofar as the expectation that the data is intended
    to be read, modified, and written back.  It is not possible to
    optimize with the expectation that the data would only be written to
    the mmap, for the reasons described above.  The hardware simply does not
    provide sufficient information to the operating system to optimize 
    the write-only case.

:Unfortunately, my problem, so far, is with it not writing _at all_...

    Not sure what is going on since I can't access the program yet, but
    I'd be happy to take a look at the code.

    The most common mistake people make when trying to write to a file via
    mmap() is that they forget to ftruncate() the file to the proper length
    first.  Mapped memory beyond the file's EOF is ignored within the last
    page, and the program will take a page fault if it tries to write to
    mapped pages that are entire beyond the file's current EOF.  Writing
    to mapped memory does *not* extend the size of a file.  Only 
    ftruncate() or write() can extend the size of a file.

    The second most common mistake is to forget to specify MAP_SHARED
    in the mmap() call.

:Yes, this is an example of how a good implemented mmap can be better than 
:write. Without explicit writes by the application and without doubling the 
:memory requirements, the data can be written in the most optimal way.
:Thanks for your help. Yours,
:	-mi

    I don't think mmap()-based writing will EVER be more efficient then
    write() except in the case where the entire data set fits into memory
    and has been entirely cached by the system.  In that one case writing via
    mmap will be faster.  In all other cases the system will be taking as
    many VM faults on the pages as it would be taking system call faults
    to execute the write()'s.

    You are making a classic mistake by assuming that the copying overhead
    of a write() into the file's backing store, verses directly mmap()ing
    the file's backing store, represents a large chunk of the overhead for
    the operation.  In fact, the copying overhead represents only a small
    chunk of the related overhead.  The vast majority of the overhead is
    always going to be the disk I/O itself.

    I/O must occur even in the cached/delayed-write case so on a busy system
    it still represents the greatest overhead from the point of view of
    system load.  On a lightly loaded system nobody is going to care about
    a few milliseconds of improved performance here and there since, by 
    definition, the system is lightly loaded and thus has plenty of idle
    cpu and I/O cycles to spare.
					Matthew Dillon 
					<dillon at>

More information about the freebsd-stable mailing list