NFS reads vs. writes

Mon Jan 4 08:30:20 UTC 2016

On Sun, 3 Jan 2016, Rick Macklem wrote:

> Mikhail T. wrote:
>> On 03.01.2016 02:16, Karli SjÃ¶berg wrote:
>>>
>>> The difference between "mount" and "mount -o async" should tell you if
>>> you'd benefit from a separate log device in the pool.
>>>
>> This is not a ZFS problem. The same filesystem is being read in both
>> cases. The same data is being read from and written to the same
>> filesystems. For some reason, it is much faster to read via NFS than to
>> write to it, however.
>>
> This issue isn't new. It showed up when Sun introduced NFS in 1985.

nfs writes are slightly faster than reads in most configurations for me.
This is because writes are easier to stream and most or all configurations
don't do a very good just of trying to stream reads.

> NFSv3 did change things a little, by allowing UNSTABLE writes.

Of course I use async mounts (and ffs) if I want writes to be fast.  Both
the server and the client fs should be mounted async.  This is most important
for the client.

> Here's what an NFSv3 or NFSv4 client does when writing:

nfs also has a badly designed sysctl vfs.nfsd.async which does something
more hackish for nfsv2 and might have undesirable side effects for nfsv3+.
Part of its bad design is that it is global.  It affects all clients.
This might be a feature if the clients don't support async mounts.  I never
use this.

> - Issues some # of UNSTABLE writes. The server need only have these is server
>  RAM before replying NFS_OK.
> - Then the client does a Commit. At this point the NFS server is required to
>  store all the data written in the above writes and related metadata on stable
>  storage before replying NFS_OK.

async mounts in the FreeBSD client are implemented by 2 lines of code
(and "async" in the list of supported options) that seem to work by
pretending that UNSTABLE writes are FILESYNC so the Commit step is null.
Thus everything except possibly metadata is async and unstable but the
client doesn't know this.

If the server fs is mounted with inconsistent async flags or the async
flags give inconsistent policies, some async writes may turn into sync
and vice versa.  The worst inconsistencies are with a default (delayed
Commit) client and an async (non-soft updates) server.  Then async breaks
the Commits by writing sync data but still writing async metadata.  My
version has partial fixes (it syncs inodes but not directories in fsync()
for async mounts).

>  --> This is where the "sync" vs "async" is a big issue. If you use "sync=disabled"
>      (I'm not a ZFS guy, but I think that is what the ZFS option looks likes) you
>      *break* the NFS protocol (ie. violate the RFC) and put your data at some risk,
>      but you will typically get better (often much better) write performance.

Is zfs really as broken as ffs with async mounts?  It takes ignoring FSYNC/
IO_SYNC flags when mounted async to get full brokenness.  async for ffs was
originally a hack to do something like that.  I think it now honors the
sync flags for everything except inodes and directories.

Sync everything is too slow to use for everything, but the delayed Commit
should make it usable, depending on how long the delay is.  Perhaps it
can interract badly with the server fs's delays.  Something like a pipeline
stall on a CPU -- to satisfy a synchronization request for 1 file, it might
be necessary to wait for many MB of i/o for other files first.

> Also, the NFS server was recently tweaked so that it could handle 128K rsize/wsize,
> but the FreeBSD client is limited to MAXBSIZE and this has not been increased
> beyond 64K. To do so, you have to change the value of this in the kernel sources

Larger i/o sizes give negative benefits for me.  Changes in the default sizes
give confusing peformance differences with larger sizes mostly worse, but
there are too many combinations to test and I never figured out the details,
so I now force small sizes at mount time.  This depends on having a fast
network.  With a really slow network, the i/o sizes must be very large or
the streaming must be good.

> and rebuild your kernel. (The problem is that increasing MAXBSIZE makes the kernel
> use more KVM for the buffer cache and if a system isn't doing significant client
> side NFS, this is wasted.)
> Someday, I should see if MAXBSIZE can be made a TUNABLE, but I haven't done that.
> --> As such, unless you use a Linux NFS client, the reads/writes will be 64K, whereas
>    128K would work better for ZFS.

Not for ffs with 16K-blocks.  Clustering usually turns these into 128K-blocks
but nfs client see little difference and may even work better with 8K-blocks.

Bruce