Fixes to allow write clustering of NFS writes from a FreeBSD
NFS client
John Baldwin
jhb at freebsd.org
Thu Aug 25 21:09:31 UTC 2011
On Thursday, August 25, 2011 3:24:15 pm Bruce Evans wrote:
> On Thu, 25 Aug 2011, John Baldwin wrote:
>
> > I was doing some analysis of compiles over NFS at work recently and noticed
> > from 'iostat 1' on the NFS server that all my NFS writes were always 16k
> > writes (meaning that writes were never being clustered). I added some
>
> Did you see the old patches for this by Bjorn Gronwall? They went through
> many iterations. He was mainly interested in the !async case and I was
> mainly interested in the async case...
Ah, no I had not seen these, thanks.
> > and moved it into a function to compute a sequential I/O heuristic that
> > could be shared by both reads and writes. I also updated the sequential
> > heuristic code to advance the counter based on the number of 16k blocks
> > in each write instead of just doing ++ to match what we do for local
> > file writes in sequential_heuristic() in vfs_vnops.c. Using this did
> > give me some measure of NFS write clustering (though I can't peg my
> > disks at MAXPHYS the way a dd to a file on a local filesystem can). The
>
> I got close to it. The failure modes were mostly burstiness of i/o, where
> the server buffer cache seemed to fill up so the client would stop sending
> and stay stopped for too long (several seconds; enough to reduce the
> throughput by 40-60%).
Hmm, I can get writes up to around 40-50k, but not 128k. My test is to just
dd from /dev/zero to a file on the NFS client using a blocksize of 64k or so.
> > patch for these changes is at
> > http://www.FreeBSD.org/~jhb/patches/nfsserv_cluster_writes.patch
> >
> > (This also fixes a bug in the new NFS server in that it wasn't actually
> > clustering reads since it never updated nh->nh_nextr.)
> >
> > Combining the two changes together gave me about a 1% reduction in wall
> > time for my builds:
> >
> > +------------------------------------------------------------------------------+
> > |+ + ++ + +x++*x xx+x x x|
> > | |___________A__|_M_______|_A____________| |
> > +------------------------------------------------------------------------------+
> > N Min Max Median Avg Stddev
> > x 10 1869.62 1943.11 1881.89 1886.12 21.549724
> > + 10 1809.71 1886.53 1869.26 1860.706 21.530664
> > Difference at 95.0% confidence
> > -25.414 +/- 20.2391
> > -1.34742% +/- 1.07305%
> > (Student's t, pooled s = 21.5402)
> >
> > One caveat: I tested both of these patches on the old NFS client and server
> > on 8.2-stable. I then ported the changes to the new client and server and
> > while I made sure they compiled, I have not tested the new client and server.
>
> Here is the version of Bjorn's patches that I last used (in 8-current in
> 2008):
>
> % Index: nfs_serv.c
> % ===================================================================
> % RCS file: /home/ncvs/src/sys/nfsserver/nfs_serv.c,v
> % retrieving revision 1.182
> % diff -u -2 -r1.182 nfs_serv.c
> % --- nfs_serv.c 28 May 2008 16:23:17 -0000 1.182
> % +++ nfs_serv.c 1 Jun 2008 05:52:45 -0000
> % @@ -107,14 +107,18 @@
> % #define MAX_COMMIT_COUNT (1024 * 1024)
> %
> % -#define NUM_HEURISTIC 1017
> % +#define NUM_HEURISTIC 1031 /* Must be prime! */
> % +#define MAX_REORDERED_RPC 4
> % +#define HASH_MAXSTEP 0x3ff
> % #define NHUSE_INIT 64
> % #define NHUSE_INC 16
> % #define NHUSE_MAX 2048
> % +#define NH_TAG(vp) ((uint32_t)((uintptr_t)vp / sizeof(struct vnode)))
> % +CTASSERT(NUM_HEURISTIC > (HASH_MAXSTEP + 1));
Hmm, aside from 1017 not being prime (3 * 339), I'm not sure what the reasons
for the rest of these changes are.
> %
> % static struct nfsheur {
> % - struct vnode *nh_vp; /* vp to match (unreferenced pointer) */
> % - off_t nh_nextr; /* next offset for sequential detection */
> % - int nh_use; /* use count for selection */
> % - int nh_seqcount; /* heuristic */
> % + off_t nh_nextoff; /* next offset for sequential detection */
> % + uint32_t nh_tag; /* vp tag to match */
> % + uint16_t nh_use; /* use count for selection */
> % + uint16_t nh_seqcount; /* in units of logical blocks */
> % } nfsheur[NUM_HEURISTIC];
Hmm, not sure why this only stores the tag and uses uint16_t values here.
The size difference is a few KB at best, and I'd rather store the full
vnode to avoid oddities from hash collisions.
> % @@ -131,8 +135,14 @@
> % static int nfs_commit_blks;
> % static int nfs_commit_miss;
> % +static int nfsrv_cluster_writes = 1;
> % +static int nfsrv_cluster_reads = 1;
> % +static int nfsrv_reordered_io;
> % SYSCTL_INT(_vfs_nfsrv, OID_AUTO, async, CTLFLAG_RW, &nfs_async, 0, "");
> % SYSCTL_INT(_vfs_nfsrv, OID_AUTO, commit_blks, CTLFLAG_RW, &nfs_commit_blks, 0, "");
> % SYSCTL_INT(_vfs_nfsrv, OID_AUTO, commit_miss, CTLFLAG_RW, &nfs_commit_miss, 0, "");
> %
> % +SYSCTL_INT(_vfs_nfsrv, OID_AUTO, cluster_writes, CTLFLAG_RW, &nfsrv_cluster_writes, 0, "");
> % +SYSCTL_INT(_vfs_nfsrv, OID_AUTO, cluster_reads, CTLFLAG_RW, &nfsrv_cluster_reads, 0, "");
> % +SYSCTL_INT(_vfs_nfsrv, OID_AUTO, reordered_io, CTLFLAG_RW, &nfsrv_reordered_io, 0, "");
> % struct nfsrvstats nfsrvstats;
> % SYSCTL_STRUCT(_vfs_nfsrv, NFS_NFSRVSTATS, nfsrvstats, CTLFLAG_RW,
> % @@ -145,4 +155,73 @@
> %
> % /*
> % + * Detect sequential access so that we are able to hint the underlying
> % + * file system to use clustered I/O when appropriate.
> % + */
> % +static int
> % +nfsrv_sequential_access(const struct uio *uio, const struct vnode *vp)
> % +{
> % + struct nfsheur *nh;
> % + unsigned hi, step;
> % + int try = 8;
> % + int nblocks, lblocksize;
> % +
> % + /*
> % + * Locate best nfsheur[] candidate using double hashing.
> % + */
> % +
> % + hi = NH_TAG(vp) % NUM_HEURISTIC;
> % + step = NH_TAG(vp) & HASH_MAXSTEP;
> % + step++; /* Step must not be zero. */
> % + nh = &nfsheur[hi];
I can't speak to whether using a variable step makes an appreciable
difference. I have not examined that in detail in my tests.
> % +
> % + while (try--) {
> % + if (nfsheur[hi].nh_tag == NH_TAG(vp)) {
> % + nh = &nfsheur[hi];
> % + break;
> % + }
> % + if (nfsheur[hi].nh_use > 0)
> % + --nfsheur[hi].nh_use;
> % + hi = hi + step;
> % + if (hi >= NUM_HEURISTIC)
> % + hi -= NUM_HEURISTIC;
> % + if (nfsheur[hi].nh_use < nh->nh_use)
> % + nh = &nfsheur[hi];
> % + }
> % +
> % + if (nh->nh_tag != NH_TAG(vp)) { /* New entry. */
> % + nh->nh_tag = NH_TAG(vp);
> % + nh->nh_nextoff = uio->uio_offset;
> % + nh->nh_use = NHUSE_INIT;
> % + nh->nh_seqcount = 1; /* Initially assume sequential access. */
> % + } else {
> % + nh->nh_use += NHUSE_INC;
> % + if (nh->nh_use > NHUSE_MAX)
> % + nh->nh_use = NHUSE_MAX;
> % + }
> % +
> % + /*
> % + * Calculate heuristic
> % + */
> % +
> % + lblocksize = vp->v_mount->mnt_stat.f_iosize;
> % + nblocks = howmany(uio->uio_resid, lblocksize);
This is similar to what I pulled out of sequential_heuristic() except
that it doesn't hardcode 16k. There is a big comment above the 16k
that says it isn't about the blocksize though, so I'm not sure which is
most correct. I imagine we'd want to use the same strategy in both places
though. Comment from vfs_vnops.c:
/*
* f_seqcount is in units of fixed-size blocks so that it
* depends mainly on the amount of sequential I/O and not
* much on the number of sequential I/O's. The fixed size
* of 16384 is hard-coded here since it is (not quite) just
* a magic size that works well here. This size is more
* closely related to the best I/O size for real disks than
* to any block size used by software.
*/
fp->f_seqcount += howmany(uio->uio_resid, 16384);
> % + if (uio->uio_offset == nh->nh_nextoff) {
> % + nh->nh_seqcount += nblocks;
> % + if (nh->nh_seqcount > IO_SEQMAX)
> % + nh->nh_seqcount = IO_SEQMAX;
> % + } else if (uio->uio_offset == 0) {
> % + /* Seek to beginning of file, ignored. */
> % + } else if (qabs(uio->uio_offset - nh->nh_nextoff) <=
> % + MAX_REORDERED_RPC*imax(lblocksize, uio->uio_resid)) {
> % + nfsrv_reordered_io++; /* Probably reordered RPC, do nothing. */
Ah, this is a nice touch! I had noticed reordered I/O's resetting my
clustered I/O count. I should try this extra step.
> % + } else
> % + nh->nh_seqcount /= 2; /* Not sequential access. */
Hmm, this is a bit different as well. sequential_heuristic() just
drops all clustering (seqcount = 1) here so I had followed that. I do
wonder if this change would be good for "normal" I/O as well? (Again,
I think it would do well to have "normal" I/O and NFS generally use
the same algorithm, but perhaps with the extra logic to handle reordered
writes more gracefully for NFS.)
> % +
> % + nh->nh_nextoff = uio->uio_offset + uio->uio_resid;
Interesting. So this assumes the I/O never fails.
> % @@ -1225,4 +1251,5 @@
> % vn_finished_write(mntp);
> % VFS_UNLOCK_GIANT(vfslocked);
> % + bwillwrite(); /* After VOP_WRITE to avoid reordering. */
> % return(error);
> % }
Hmm, this seems to be related to avoiding overloading the NFS server's
buffer cache?
> % @@ -1492,4 +1519,6 @@
> % }
> % if (!error) {
> % + if (nfsrv_cluster_writes)
> % + ioflags |= nfsrv_sequential_access(uiop, vp);
> % error = VOP_WRITE(vp, uiop, ioflags, cred);
> % /* XXXRW: unlocked write. */
> % @@ -1582,4 +1611,5 @@
> % }
> % splx(s);
> % + bwillwrite(); /* After VOP_WRITE to avoid reordering. */
> % return (0);
> % }
Ah, this code is no longer present in 8 (but is in 7).
> % @@ -3827,5 +3857,9 @@
> % for_ret = VOP_GETATTR(vp, &bfor, cred, td);
> %
> % - if (cnt > MAX_COMMIT_COUNT) {
> % + /*
> % + * If count is 0, a flush from offset to the end of file
> % + * should be performed according to RFC 1813.
> % + */
> % + if (cnt == 0 || cnt > MAX_COMMIT_COUNT) {
> % /*
> % * Give up and do the whole thing
This appears to be a seperate standalone fix.
> % @@ -3871,5 +3905,5 @@
> % bo = &vp->v_bufobj;
> % BO_LOCK(bo);
> % - while (cnt > 0) {
> % + while (!error && cnt > 0) {
> % struct buf *bp;
> %
> % @@ -3894,5 +3928,5 @@
> % bremfree(bp);
> % bp->b_flags &= ~B_ASYNC;
> % - bwrite(bp);
> % + error = bwrite(bp);
> % ++nfs_commit_miss;
I think you can just do a 'break' here without having to modify the while
condition. This also seems to be an unrelated bugfix.
> % } else
> % Index: nfs_syscalls.c
> % ===================================================================
> % RCS file: /home/ncvs/src/sys/nfsserver/Attic/nfs_syscalls.c,v
> % retrieving revision 1.119
> % diff -u -2 -r1.119 nfs_syscalls.c
> % --- nfs_syscalls.c 30 Jun 2008 20:43:06 -0000 1.119
> % +++ nfs_syscalls.c 2 Jul 2008 07:12:57 -0000
> % @@ -86,5 +86,4 @@
> % int nfsd_waiting = 0;
> % int nfsrv_numnfsd = 0;
> % -static int notstarted = 1;
> %
> % static int nfs_privport = 0;
> % @@ -448,7 +447,6 @@
> % procrastinate = nfsrvw_procrastinate;
> % NFSD_UNLOCK();
> % - if (writes_todo || (!(nd->nd_flag & ND_NFSV3) &&
> % - nd->nd_procnum == NFSPROC_WRITE &&
> % - procrastinate > 0 && !notstarted))
> % + if (writes_todo || (nd->nd_procnum == NFSPROC_WRITE &&
> % + procrastinate > 0))
> % error = nfsrv_writegather(&nd, slp,
> % nfsd->nfsd_td, &mreq);
This no longer seems to be present in 8.
> I have forgotten many details but can dig them out from large private
> mails about this if you care. It looks like most of the above is about
> getting the seqcount write; probably similar to what you did. -current
> is still missing the error check for the bwrite() -- that's the only
> bwrite() in the whole server. However, the fix isn't good -- one obvious
> bug is that a later successful bwrite() destroys the evidence of a
> previous bwrite() error.
One thing I had done was to use a separate set of heuristics for reading vs
writing. However, that is possibly dubious (and we don't do it for local
I/O), so I can easily drop that feature if desired.
--
John Baldwin
More information about the freebsd-fs
mailing list