misc/64816: mmap and/or ftruncate does not work correctly on
nfs mounted file systems
Peter Edwards
peadar at freebsd.org
Mon Apr 19 17:10:11 PDT 2004
The following reply was made to PR misc/64816; it has been noted by GNATS.
From: Peter Edwards <peadar at freebsd.org>
To: freebsd-gnats-submit at FreeBSD.org, patrick at spacesurfer.com
Cc: Stephen McKay <smckay at internode.on.net>
Subject: Re: misc/64816: mmap and/or ftruncate does not work correctly on
nfs mounted file systems
Date: Tue, 20 Apr 2004 01:05:13 +0100
This is a multi-part message in MIME format.
--------------060505050408010706000408
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit
I'm pretty sure I know what's going on here.
There's an optimisation in nfs_getpages() that returns partially valid
pages on the assumption that the partially _invalid_ parts are only
those beyond the end of the file. (In the test program, for example,
assuming the file doesn't exist initially, this happens when the code
hits this line:
> map[SZ-1]=1;
The first 1K of the file is valid, but the rest isn't. (because it
doesn't exist.)
However, once a page is mapped, and dragged in for a fault, if requested
page in nfs_getpages() isn't totally valid on return, the invalid parts
are zeroed and marked as valid by vm_fault().
Later, at the ftruncate(), we call nfs_flush(). While writing out the
dirty data, it gets marked as no longer "valid" (vfs_busy_pages(),
called eventually from bwrite()). However, the end of the page doesn't
go through the bwrite() interface (because it's not present in the
file), so it's not caught by this.
For our case, by the time we finish ftruncate(), we now have the first
two DEV_BSIZE chunks of the first page marked invalid and not clean, and
the remaining 6 marked as valid and clean (assuming intel architecture).
This is the exact opposite of the assumption that the optimisation
makes: that the only clean parts happen to be the "real" parts of the file.
The attached patch addresses the problem, and it's run through a
buildworld over NFS. It simply ensures that a partially valid page meets
the previously assumed criteria.
I'm awaiting a review from a somewhat more experienced hacker that I've
been pestering before committing this, but in the meantime, I thought
I'd share the analysis and patch. Any feedback appreciated.
--------------060505050408010706000408
Content-Type: text/plain;
name="patchnfs_getpages.txt"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline;
filename="patchnfs_getpages.txt"
Index: sys/nfsclient/nfs_bio.c
===================================================================
RCS file: /usr/cvs/FreeBSD-CVS/src/sys/nfsclient/nfs_bio.c,v
retrieving revision 1.130
diff -u -r1.130 nfs_bio.c
--- sys/nfsclient/nfs_bio.c 14 Apr 2004 23:23:55 -0000 1.130
+++ sys/nfsclient/nfs_bio.c 19 Apr 2004 23:19:33 -0000
@@ -40,6 +40,7 @@
#include <sys/bio.h>
#include <sys/buf.h>
#include <sys/kernel.h>
+#include <sys/sysctl.h>
#include <sys/mount.h>
#include <sys/proc.h>
#include <sys/resourcevar.h>
@@ -113,6 +114,7 @@
struct nfsmount *nmp;
vm_object_t object;
vm_page_t *pages;
+ struct nfsnode *np;
GIANT_REQUIRED;
@@ -120,6 +122,7 @@
td = curthread; /* XXX */
cred = curthread->td_ucred; /* XXX */
nmp = VFSTONFS(vp->v_mount);
+ np = VTONFS(vp);
pages = ap->a_m;
count = ap->a_count;
@@ -137,26 +140,48 @@
npages = btoc(count);
/*
- * If the requested page is partially valid, just return it and
- * allow the pager to zero-out the blanks. Partially valid pages
- * can only occur at the file EOF.
+ * If the requested page is partially valid, return it if all
+ * the invalid parts are beyond the end of the file.
+ * Partially valid pages can only occur at the file EOF.
+ *
+ * Note:
+ * The bits beyond the file may be marked as valid if the file is
+ * mapped: vm_fault() calls vm_page_zero_invalid() for the sections
+ * of the page that the pager doesn't validate, (for NFS, those
+ * beyond the end of the file). Unlike the other bits in the page,
+ * these will not get cleared by vinvalbuf(), so we can end up in the
+ * situation where the part of the page representing real data in the
+ * file is not valid, while the part extending past the end is marked
+ * as such.
*/
{
+ int resid;
+ u_long mask; /* Largest type for "valid" in vm_page. */
vm_page_t m = pages[ap->a_reqpage];
VM_OBJECT_LOCK(object);
vm_page_lock_queues();
if (m->valid != 0) {
- /* handled by vm_fault now */
- /* vm_page_zero_invalid(m, TRUE); */
- for (i = 0; i < npages; ++i) {
- if (i != ap->a_reqpage)
- vm_page_free(pages[i]);
+ off_t pgoff = IDX_TO_OFF(m->pindex);
+ off_t fileoff = np->n_size;
+ resid = fileoff - pgoff;
+ KASSERT(fileoff - pgoff < PAGE_SIZE,
+ ("partially valid page not at end of file"));
+ mask = 1;
+ for (mask = 1; m->valid & mask; mask <<= 1) {
+ resid -= DEV_BSIZE;
+ if (resid <= 0) {
+ /* Page is valid enough, return it */
+ for (i = 0; i < npages; ++i) {
+ if (i != ap->a_reqpage)
+ vm_page_free(pages[i]);
+ }
+ vm_page_unlock_queues();
+ VM_OBJECT_UNLOCK(object);
+ return(0);
+ }
}
- vm_page_unlock_queues();
- VM_OBJECT_UNLOCK(object);
- return(0);
}
vm_page_unlock_queues();
VM_OBJECT_UNLOCK(object);
@@ -229,8 +254,6 @@
*/
m->valid = 0;
vm_page_set_validclean(m, 0, size - toff);
- /* handled by vm_fault now */
- /* vm_page_zero_invalid(m, TRUE); */
} else {
/*
* Read operation was short. If no error occured
--------------060505050408010706000408--
More information about the freebsd-bugs
mailing list