silly write caching in nfs3
Bruce Evans
brde at optusnet.com.au
Sat Feb 27 04:21:16 UTC 2016
On Fri, 26 Feb 2016, Bruce Evans wrote:
> nfs3 is slower than in old versions of FreeBSD. I debugged one of the
> reasons today.
> ...
> oldnfs was fixed many years ago to use timestamps with nanoseconds
> resolution, but it doesn't suffer from the discarding in nfs_open()
> in the !NMODIFIED case which is reached by either fsync() before close
> of commit on close. I think this is because it updates n_mtime to
> the server's new timestamp in nfs_writerpc(). This seems to be wrong,
> since the file might have been written to by other clients and then
> the change would not be noticed until much later if ever (setting the
> timestamp prevents seeing it change when it is checked later, but you
> might be able to see another metadata change).
>
> newfs has quite different code for nfs_writerpc(). Most of it was
> moved to another function in nanother file. I understand this even
> less, but it doesn't seem to have fetch the server's new timestamp or
> update n_mtime in the v3 case.
This quick fix seems to give the same behaviour as in oldnfs. It also
fixes some bugs in comments in nfs_fsync() (where I tried to pass a
non-null cred, but none is available. The ARGSUSED bug is in many
other functions):
X Index: nfs_clvnops.c
X ===================================================================
X --- nfs_clvnops.c (revision 296089)
X +++ nfs_clvnops.c (working copy)
X @@ -1425,6 +1425,23 @@
X }
X if (DOINGASYNC(vp))
X *iomode = NFSWRITE_FILESYNC;
X + if (error == 0 && NFS_ISV3(vp)) {
X + /*
X + * Break seeing concurrent changes by other clients,
X + * since without this the next nfs_open() would
X + * invalidate our write buffers. This is worse than
X + * useless unless the write is committed on close or
X + * fsynced, since otherwise NMODIFIED remains set so
X + * the next nfs_open() will still invalidate the write
X + * buffers. Unfortunately, this cannot be placed in
X + * ncl_flush() where NMODIFIED is cleared since
X + * credentials are unavailable there for at least
X + * calls by nfs_fsync().
X + */
X + mtx_lock(&(VTONFS(vp))->n_mtx);
X + VTONFS(vp)->n_mtime = nfsva.na_mtime;
X + mtx_unlock(&(VTONFS(vp))->n_mtx);
X + }
X if (error && NFS_ISV4(vp))
X error = nfscl_maperr(uiop->uio_td, error, (uid_t)0, (gid_t)0);
X return (error);
X @@ -2613,9 +2630,8 @@
X }
X
X /*
X - * fsync vnode op. Just call ncl_flush() with commit == 1.
X + * fsync vnode op.
X */
X -/* ARGSUSED */
X static int
X nfs_fsync(struct vop_fsync_args *ap)
X {
X @@ -2622,8 +2638,12 @@
X
X if (ap->a_vp->v_type != VREG) {
X /*
X + * XXX: this comment is misformatted (after fixing its
X + * internal errors) and misplaced.
X + *
X * For NFS, metadata is changed synchronously on the server,
X - * so there is nothing to flush. Also, ncl_flush() clears
X + * so the only thing to flush is data for regular files.
X + * Also, ncl_flush() clears
X * the NMODIFIED flag and that shouldn't be done here for
X * directories.
X */
> There are many other reasons why nfs is slower than in old versions.
> One is that writes are more often done out of order. This tends to
> give a slowness factor of about 2 unless the server can fix up the
> order. I use an old server which can do the fixup for old clients but
> not for newer clients starting in about FreeBSD-9 (or 7?). I suspect
> that this is just because Giant locking in old clients gave accidental
> serialization. Multiple nfsiod's and/or nfsd's are are clearly needed
> for performance if you have multiple NICs serving multiple mounts.
> Other cases are less clear. For the iozone benchmark, there is only
> 1 stream and multiple nfsiod's pessimize it into multiple streams that
> give buffers which arrive out of order on the server if the multiple
> nfsiod's are actually active. I use the following configuration to
> ameliorate this, but the slowness factor is still often about 2 for
> iozone:
> - limit nfsd's to 4
> - limit nfsiod's to 4
> - limit nfs i/o sizes to 8K. The server fs block size is 16K, and
> using a smaller block size usually helps by giving some delayed
> writes which can be clustered better. (The non-nfs parts of the
> server could be smarter and do this intentionally. The out-of-order
> buffers look like random writes to the server.) 16K i/o sizes
> otherwise work OK, but 32K i/o sizes are much slower for unknown
> reasons.
Size 16K seems to work better now.
I also use:
- turn off most interrupt moderation. This reduces (ping) latency from
~125 usec to ~75 usec for em on PCIe (after already turning off interrupt
moderation on the server to reduce it from 150-200 usec). 75 usec
is still a lot, though it is about 3 times lower than the default
misconfiguration. Downgrading up to older lem on PCI/33 reduces it to
52. Downgrading to DEVICE_POLLING reduces it to about 40. The
dowgrades are upgrades :-(. Not using a switch reduces it by about
another 20.
Low latency important for small i/o's. I was suprised that it also
helps a lot for large i/o's. Apparently it changes the timing enough
to reduce the out-of-order buffers significantly.
The default misconfiguration with 20 nfsiod's is worse than I expected
(on an 8 core system). For (old) "iozone auto" which starts with a file
size of 1MB, the write speed is about 2MB/sec with 20 nfsiod's and
22 MB/sec with 1 nfsiod. 2-4 nfsiod's work best. They give 30-40MB/sec
for most file sizes. Apparently, with 20 nfsiod's the write of 1MB is
split up into almost twenty pieces of 50K each (6 or 7 8K buffers each),
and the final order is perhaps even worse than random. I think it is
basically sequential with about <number of nfsiods> seeks for all file
sizes between 1MB and many MB.
I also use:
- no PREEMPTION and no IPI_PREEMPTION on SMP systems. This limits context
switching.
- no SCHED_ULE. HZ = 100. This also limits context switching.
With more or fairer context switching, all nfsiods are more likely to run,
causing more damage.
More detailed result for iozone 1 65536 with nfsiodmax=64 and oldnfs and
mostly best known other tuning:
- first run write speed 2MB/S (probably still using 20)
(all rates use disk marketing MB)
- second run 9MB/S
- after repeated runs, 250MB/S
- the speed kept mostly dropping, and reached 21K/S
- server stats for next run at 29K/S: 139 blocks tested and order of
24 fixed (the server has an early version of what is in -current,
with more debugging)
with nfsiodmax=20:
- most runs 2-2.2MB/S; one at 750K/S
- server stats for a run at 2.2MB/S: 135 blocks tested and 86 fixed
with nfsiodmax=4:
- 5.8-6.5MB/S
- server stats for a run at 6.0MB/S: 135 blocks tested and 0 fixed
with nfsiodmax=2:
- 4.8-5.2MB/S
- server stats for a run at 5.1MB/S: 138 blocks tested and 0 fixed
with nfsiodmax=1:
- 3.4MB/S
- server stats: 138 blocks tested and 0 fixed
For iozone 512 65536:
with nfsiodmax=1:
- 34.7MB/S
- server stats: 65543 blocks tested and 0 fixed
with nfsiodmax=2:
- 45.9MB/S (this is close to the drive's speed and faster than direct on the
server. It is faster because everything the clustering accidentally works
better)
- server stats: 65550 blocks tested and 578 fixed
with nfsiodmax=4:
- 45.6MB/S
- server stats: 65550 blocks tested and 2067 fixed
with nfsiodmax=20:
- 21.4MB/S
- server stats: 65576 blocks tested and 12057 fixed
(it is easy to see how 7 nfsiods could give 1/7 = 14% of blocks
out of order. The server is fixing up almost 20%, but that is
not enough)
with nfsiodmax=64 (caused server to not respond):
- test aborted at 500+MB
- server stats: about 10000 blocks fixed
with nfsiodmax=64 again:
- 9.6MB/S
- server stats: 65598 blocks tested and 14034 fixed
The nfsiod's get scheduled almost equally.
Bruce
More information about the freebsd-fs
mailing list