kern/165927: msync reports success after a failed pager flush

Sun Mar 11 11:20:11 UTC 2012

>Number:         165927
>Category:       kern
>Synopsis:       msync reports success after a failed pager flush
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    freebsd-bugs
>State:          open
>Quarter:        
>Keywords:       
>Date-Required:
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Sun Mar 11 11:20:10 UTC 2012
>Closed-Date:
>Last-Modified:
>Originator:     Joel Ray Holveck <joelh at juniper.net>
>Release:        FreeBSD 8.3-PRERELEASE i386
>Organization:
Juniper Networks, Inc.
>Environment:
System: FreeBSD thor.piquan.org 8.3-PRERELEASE FreeBSD 8.3-PRERELEASE #2: Sat Feb 25 15:52:16 PST 2012 root at thor.piquan.org:/usr/obj/usr/src/sys/THOR i386

>Description:
When a process is writing to an mmap-backed file, under certain common
circumstances, changes to data might not be properly flushed.
Nevertheless, msync may report success.

The bug is most easily demonstrated using NFS, so much of this
description refers to NFS-based errors.  However, these are only
examples; the bug can apply to many other filesystems as well.

If a process has an NFS-backed file mmapped in and dirties the data,
there are several common circumstances under which it might not be
properly flushed.  The bug in kern/165923 is one situation, in which
the backing file is written with the wrong uid, leading to a return of
NFSERR_ACCES.  Another client might delete the file, making the server
return NFSERR_STALE.

Formerly (e.g., in 8.2-RELEASE), this would cause the client's VM
subsystem to go into an infinite loop: the client would attempt to
flush to the server, the server returns an error, the client leaves
the pages on the dirty list but still needs to flush them, repeat ad
infinitum.

In r223054 (on stable/8; MFC r222586), this behavior was changed: the
VM system marks the pages as clean to avoid this type of loop.
However, this comes with its own set of problems.

As an example, consider the case where a process is gathering data
into an mmap-backed datastore.  The process gathers some data into the
datastore.  While this is happening, another client changes the
ownership or mode of the file.  Next, the syncer daemon attempts to
flush the datastore, but since it fails, the pages are marked as
clean.  The data-gathering process later runs msync, and since the
pages are "clean" (according to the client's VM system), msync returns
success.  However, the data has never been written to disk.

>How-To-Repeat:
See the program in the "How-To-Repeat" section of kern/165923.

If kern/165923 has not yet been fixed, then that program will
demonstrate the bug by itself using the instructions in that PR: note
that the pages are not written, but msync returns success.

Alternately, the program can still demonstrate the bug, but with more
effort.  Make sure that both WAIT_FOR_SYNC and DO_MMAP are turned on.
As the client program sleeps during the WAIT_FOR_SYNC interval, on the
server run "chattr uchg backing-store".  (A chmod won't be sufficient
on a FreeBSD 8.2 server, but might be on others.)  Be quick; you have
to do this before the client's syncer flushes the file, which will
happen within 0-30 seconds.  (If kern/165923 has not been fixed, then
you don't have to hurry; the syncer can't save the file.)  Either wait
for the sleep to return, or press ^C (which will stop the sleep and
continue with the call to msync).

Observe (using "od -X" or similar) that the file's contents will not
have changed, but the msync succeeded.

This indicates that msync(2) is a necessary, but NOT sufficient, way
for a process to verify that mapped files are flushed.  The idea that
it's necessary is contrary to the documentation in the msync(2) and
mmap(2) man pages.  The fact that it's not sufficient is contrary to
POSIX's assertion that msync may be used "for synchronized I/O data
integrity completion" (and more explicit verbage in the informative
sections; cf
<http://pubs.opengroup.org/onlinepubs/9699919799/functions/msync.html>),
which is the subject of the present PR.

>Fix:
The VM system currently (as of r222586) marks pages that cannot be
written as clean.  Instead, the VM object should be made unavailable
(unmapped, set VM_PROT_NONE, or similar), so that later memory
accesses raise a SIGSEGV and msyncs return EINVAL (or ENOMEM according
to POSIX.1-2008).  While this means that the program will almost
certainly exit with an error, that is appropriate, since its write did
fail.  (This is also similar to what happens if a swap drive fails.)

This bug is most visible in conjunction with kern/165923, since that
bug causes the sort of failure that triggers the bug currently under
discussion.  However, they are independent.  As described in
How-To-Repeat, an analogous situation that can with NFSERR_STALE.
>Release-Note:
>Audit-Trail:
>Unformatted: