NFS reads vs. writes

Mon Jan 4 13:43:31 UTC 2016

Mikhail T. wrote:
> On 03.01.2016 20:37, Rick Macklem wrote:
> > This issue isn't new. It showed up when Sun introduced NFS in 1985.
> > NFSv3 did change things a little, by allowing UNSTABLE writes.
> Thank you very much, Rick, for the detailed explanation.
> > If you use "sync=disabled"
> >       (I'm not a ZFS guy, but I think that is what the ZFS option looks
> >       likes) you
> >       *break* the NFS protocol (ie. violate the RFC) and put your data at
> >       some risk,
> >       but you will typically get better (often much better) write
> >       performance.
> Yes, indeed. Disabling sync got the writing throughput all the way up to
> about 86Mb/s... I still don't fully understand, why local writes are
> able to achieve this speed without async and without being considered
> dangerous.
The risk of data loss when using "sync=disabled" goes like this:
- Client does a series of writes, followed by Commit.
- Server replies OK to Commit, but hasn't yet actually committed the data
  to stable storage.
  - Server crashes/reboots just after sending the Commit reply and before getting
    the data on stable storage, loosing some of the recently written data.
- Client flushes cache when OK reply to Commit received, because it assumes
  (per RFC) that the data is "safely stored". (It never crashes/reboots.)
After this, it might be days/weeks/months before someone notices that the file
data is stale/corrupted. (The problem is that the client has no reason to think
that the data might be corrupt. It didn't crash/reboot and it doesn't know that
the server crashed/rebooted. All it saw was a period of slow response when the
server crashed/rebooted.

For a local write, the process will be killed by the crash/reboot (or at least
it will be known that the machine crashed just as the process was exiting).
For POSIX type systems data loss is expected for this case.

However, as you can see, the risk of data loss only lasts for a short period
of time right after the server replies to a Commit and if you have a reliable
server with good UPS etc, the risk may be acceptable.
(To me the important part is that people be aware of the risk so that they
 can make a judgement call.)

Btw, a trivial test I did here (don't take this as a benchmark) where I had
two partitions on the same drive (one UFS and one a zpool) with everything
else the same, I see a write rate for ZFS of about 25% of what I see for UFS.

Others with experience using ZFS can maybe chime in w.r.t. how they handle
NFS write performance for ZFS?

rick
ps: For 10.2 and later, what I said w.r.t. MAXBSIZE wasn't accurate.
    The constants you have to change for 10.2 and later are MAXBCACHESIZE
    and BKVASIZE.

> > Also, the NFS server was recently tweaked so that it could handle 128K
> > rsize/wsize,
> > but the FreeBSD client is limited to MAXBSIZE and this has not been
> > increased
> > beyond 64K.
> I just tried lowering ZFS' recordsize to 64k to match MAXBSIZE, but that
> didn't help NFS-writing (unless sync is disabled, that is).
> > If this SSD is dedicated to the ZIL and is one known to have good write
> > performance,
> > it should help, but in your case the SSD seems to be the bottleneck.
> It is a chunk of an older SSD, that also houses the OS. But it is
> usually idle, because executables and libraries are cached in the
> abundant RAM. I've seen it do 90+Mb/s (sequential)...
> 
> I just tried removing ZIL from the receiving pool -- to force direct
> writes -- but it didn't help the case, where the writes go over NFS.
> However, the local writes -- with reads from NFS -- went from the 56Mb/s
> I was seeing earlier to 90Mb/s!..
> 
> There is got to be a better way to do this -- preferably, some
> self-tuning smarts... Thanks again. Yours,
> 
>     -mi
> 
>