kern/165923: Writing to NFS-backed mmapped files fails if flushed automatically

Sun Mar 11 11:10:10 UTC 2012

>Number:         165923
>Category:       kern
>Synopsis:       Writing to NFS-backed mmapped files fails if flushed automatically
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    freebsd-bugs
>State:          open
>Quarter:        
>Keywords:       
>Date-Required:
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Sun Mar 11 11:10:09 UTC 2012
>Closed-Date:
>Last-Modified:
>Originator:     Joel Ray Holveck <joelh at juniper.net>
>Release:        FreeBSD 8.3-PRERELEASE i386
>Organization:
Juniper Networks, Inc.
>Environment:
System: FreeBSD thor.piquan.org 8.3-PRERELEASE FreeBSD 8.3-PRERELEASE #2: Sat Feb 25 15:52:16 PST 2012 root at thor.piquan.org:/usr/obj/usr/src/sys/THOR i386

>Description:
When a process is using mmap to write to an NFS-backed file, under
certain common circumstances, changes to data may not be properly
flushed.  (Note that presently, msync will report success in these
conditions; this will be addressed in a separate PR.)

If a process has an NFS-backed file mmapped in and dirties the data,
the uid used for the RPC depends on what prompted the flush.  If the
process initiated the flush with msync (or sync), then the process's
euid is used; this would be the expected behavior.  However, if the
syncer daemon initiates the flush, then uid 0 is used.

Since it is common for NFS servers to use -maproot=-2:-2 or something
similar to map accesses by uid 0 to a different (non-privileged) uid,
the server behaves as if a different user was writing to the file than
the process that's using mmap.  Assuming the file is mode 0644 or more
restrictive, the server will return NFSERR_ACCES.

Formerly (e.g., in 8.2-RELEASE), this would cause the client's VM
subsystem to go into an infinite loop: the client would attempt to
flush to the server, the server returns NFSERR_ACCES, the client
leaves the pages on the dirty list but still needs to flush them,
repeat ad infinitum.

In r223054 (on stable/8; MFC r222586), this behavior was changed: the
VM system marks the pages as clean to avoid this type of loop.
However, this comes with its own set of problems.

As an example, consider the case where a process is gathering data
into an mmap-backed datastore.  The process gathers some data into the
datastore.  Before the process has a chance to msync, the syncer
causes the datastore to be flushed, but since it fails, the pages are
marked as clean.  After this, even if the data-gathering process calls
msync, the data will never be written to disk.

As another example, the process may not explicitly call msync, and
simply trust the VM system to ensure that the data is written.  While
such a process SHOULD call msync to verify that its data is
successfully written, this is akin to checking the return value of
close(2): important for data integrity but often skipped.  If a
process doesn't ever call msync, then the data will never be flushed.
This is contrary to the mmap(2) and msync(2) man pages, which both
state that "[t]he msync(2) system call is obsolete since BSD
implements a coherent file system buffer cache."  (A separate PR will
be opened to update these man pages to consider the present bug,
as well as the error-reporting usefulness of msync.)

>How-To-Repeat:
The trivial reproduction program would mmap a file over NFS as a user
other than root, when the server has -maproot=-2:-2 (or similar), with
a mode that would prevent uid -2 from writing to it.  Then, change a
byte, and exit.

The program attached here demonstrates a few different aspects of this
bug.  As provided, this demonstrates a case in which the data is not
properly written, but msync reports success.  (The present bug is only
about the data not being written; the failure of msync to report an
error will be addressed in a separate PR.)  See also "NOTES ON
ALTERNATE SCENARIO" below.

To use:

1. Set up an NFS server and a filesystem to export.  Do not set
   -maproot (alias -r) or -mapall in exports(5) for this filesystem.
2. Set up an NFS client using FreeBSD 8.2-STABLE after svn commit
   223054 (2011-06-13), 8.3-PRERELEASE, etc.  Mount the NFS filesystem
   on /mnt.  (IMPORTANT: Using 8.2 from before r223054 may have adverse
   effects on your network; see "EFFECTS ON 8.2-RELEASE", below.)
3. Pick a user other than root.  (Your usual uid is fine.)
4. Create a file /mnt/backing-store owned by this user, mode 0644.
5. Compile the attached program.
6. (Optional): Start netstat -iw1 -I$YOUR_NFS_INTERFACE
7. Run the program as the user from step 3.  Note the value that
   it expects to write.
8. (Optional): Once you see a 500k burst in the netstat from step 6,
   interrupt the program (froms step 7) with ^C.  (This is just to
   save you from waiting.)
9. On the NFS server, view backing-store using "od -X" or similar.
   Note that the value from step 7 has not been properly written.

* NOTES ON ALTERNATE SCENARIO

The #defines WAIT_FOR_SYNC and DO_MSYNC are provided to explore other
scenarios.  Notably, if WAIT_FOR_SYNC is not defined, then the
DO_MSYNC define allows the demonstration of the underlying bug: the
data will be properly flushed if and only if DO_MSYNC is enabled.
More fundamentally, the uid used to flush the data is the process's
euid if DO_MSYNC is enabled, but root if it is not.

This is in contradiction to the intended design of the VM system,
which is intended to make msync obsolete.  This obsolescence is
documented in the msync(2) and mmap(2) man pages (although the
error-reporting properties of msync(2) still make it useful; this will
be addressed in a separate doc PR.)

Note that, even if WAIT_FOR_SYNC is not defined, it still possible for
the syncer daemon to attempt to flush data between when a process
writes to a mapped page and when it calls msync; the call to sleep(3)
in the attached program is just to force that race.  (I hypothesize
that this would also be the case if the page were evicted due to
memory pressure, rather than flushed by the syncer.)

This indicates that msync(2) is a necessary, but NOT sufficient, way
for a process to verify that mapped files are flushed.  The idea that
it's necessary is contrary to the VM system's intent, as documented in
the msync(2) and mmap(2) man pages.  The fact that it's not sufficient
is contrary to POSIX's assertion that msync may be used "for
synchronized I/O data integrity completion" (and more explicit verbage
in the informative sections).

* EFFECTS ON 8.2-RELEASE

If this program is run on a FreeBSD 8.2 client prior to r223054
(2011-06-13), then a different bug will appear: the kernel will enter
an infinite loop in which it tries to flush the data, fails, and
retries indefinitely.  In extreme (but realistic and indeed observed)
circumstances, this can significantly lower the client's
responsiveness.  It also can put significant load on both the network
and the NFS server.

If you find yourself in this situation (which you can identify from
high traffic with netstat -i or, more reliably, using tcpdump to
inspect NFS traffic), chmod the backing-store file to mode 0666, wait
a moment for the client to flush the data, then chmod it back to 0644.
Do NOT remove the backing-store file before it's written!  If the
backing-store file is removed during this loop, the error will change
from NFSERR_ACCES to NFSERR_STALE, and the loop cannot (AFAIK) be
stopped without rebooting the client.

--- mmap-repro.c begins here ---
#include <sys/types.h>
#include <sys/mman.h>
#include <sys/sysctl.h>

#include <assert.h>
#include <err.h>
#include <fcntl.h>
#include <signal.h>
#include <stdio.h>
#include <unistd.h>

#define WAIT_FOR_SYNC
#define DO_MSYNC
#define FILENAME "/mnt/backing-store"
#define SIZE 0x100000 /* Large enough to see the sync when watching netstat */

#ifdef WAIT_FOR_SYNC
static void
sig_just_interrupt(int sig __unused)
{
	return;
}
#endif

int
main(void)
{
	int rv;
	int fd;
	pid_t *region;
	pid_t value;
	unsigned int i;
#ifdef WAIT_FOR_SYNC
	int sleeptime;
	size_t sleeptime_size = sizeof(sleeptime);
#endif

	/*
	 * Setup the mapped region.
	 */
	fd = open(FILENAME, O_RDWR | O_CREAT, 0644);
	if (fd < 0)
		err(1, "%s", FILENAME);
	rv = ftruncate(fd, SIZE);
	if (rv < 0)
		err(1, "%s: ftruncate", FILENAME);
	region = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
	if (region == NULL)
		err(1, "%s: mmap", FILENAME);
	rv = close(fd);
	if (rv != 0)
		err(1, "%s: close", FILENAME);

	/*
	 * Fill the region with our pid.
	 */
	value = getpid();
	printf("Writing %08x\n", value);
	for (i = 0; i < SIZE / sizeof(value); i++)
		region[i] = value;

#ifdef WAIT_FOR_SYNC
	/*
	 * Optionally, give the syncer time to flush the page.  The
	 * bug is that this will be written with uid 0, not our euid.
	 */
	rv = sysctlbyname("kern.filedelay", &sleeptime, &sleeptime_size,
			  NULL, 0);
	if (rv != 0)
		err(1, "kern.filedelay");
	assert(sleeptime_size == sizeof(sleeptime));

	/* Add a little slack for the sync operation to complete. */
	sleeptime += 2;

	/*
	 * Set a handler for SIGINT while waiting.  This gives the
	 * user an opportunity to press ^C during this sleep to end it
	 * early, if they have reason to believe that the syncer has
	 * run (for instance, by seeing activity in netstat -iw1).
	 */
	printf("Waiting %i seconds for the syncer to run.\n", sleeptime);
	signal(SIGINT, sig_just_interrupt);
	sleep(sleeptime);
	signal(SIGINT, SIG_DFL);
#endif	/* WAIT_FOR_SYNC */

#ifdef DO_MSYNC
	/*
	 * Call msync with MS_SYNC, which should ensure that the
	 * mapped region gets properly written, and let us know if
	 * there's a problem.
	 * 
	 * If WAIT_FOR_SYNC is set, this call will reveal a second
	 * aspect to the bug: msync will return success, even though
	 * the changes were actually discarded.
	 * 
	 * If WAIT_FOR_SYNC is not set, this will demonstrate a
	 * difference between using msync and waiting for the pager:
	 * the msync will cause the dirty data to be properly written.
	 * This is at odds with the msync and mmap man pages, which
	 * specify that "[t]he msync() system call is obsolete since
	 * BSD implements a coherent file system buffer cache."
	 * 
	 * Note that calling msync here has the same end result as
	 * running sync(8) with the same euid as this process.
	 */
	rv = msync(region, SIZE, MS_SYNC);
	if (rv)
		err(1, "msync");
#endif	/* DO_MSYNC */

	/* This munmap is not needed, but is here for completeness. */
	rv = munmap(region, SIZE);
	if (rv)
		err(1, "munmap");

	return 0;
}
--- mmap-repro.c ends here ---

>Fix:
As a workaround, applications using mmap-backed files should call
msync explicitly.  Note that there is a race condition still present;
the syncer daemon may attempt to flush the mapped file before the
process calls msync.  However, calling msync soon after changing the
mapped region does lower the probability of failure.

I haven't implemented a fix, but my current thoughts are as follows.
In sys/vm/vm_object.c:vm_object_set_writeable_dirty, the vm_object_t
should be marked with the credentials of the process that is writing.
These credentials should be passed to VOP_PUTPAGES and used for the
eventual NFS RPC in sys/nfsclient/nfs_bio.c:nfs_putpages.  (Note that
the line "cred = curthread->td_ucred;" in
sys/nfsclient/nfs_bio.c:nfs_putpages is flagged with /* XXX */.)

If I understand correctly, this parameter can be added without
breaking source or binary compatibility with third-party filesystems.
On the caller side, it seems unlikely that any third-party code calls
VOP_PUTPAGES, but this change would break both source and binary
compatibility if it did.  There are uglier ways to implement this
while maintaining compatibility, but these are not described here.
>Release-Note:
>Audit-Trail:
>Unformatted: