NFS reads vs. writes

Tue Jan 5 05:19:36 UTC 2016

On Mon, 4 Jan 2016, Tom Curry wrote:

> On Mon, Jan 4, 2016 at 12:34 AM, Mikhail T. <mi+thun at aldan.algebra.com>
> wrote:
>
>> On 03.01.2016 20:37, Rick Macklem wrote:
>> ...
>> I just tried lowering ZFS' recordsize to 64k to match MAXBSIZE, but that
>> didn't help NFS-writing (unless sync is disabled, that is).
>>> If this SSD is dedicated to the ZIL and is one known to have good write
>> performance,
>>> it should help, but in your case the SSD seems to be the bottleneck.
>> It is a chunk of an older SSD, that also houses the OS. But it is
>> usually idle, because executables and libraries are cached in the
>> abundant RAM. I've seen it do 90+Mb/s (sequential)...

Please be more careful with units (but don't use MiB's; I should killfile
that).  90 Mbits/s is still slow.

>> I just tried removing ZIL from the receiving pool -- to force direct
>> writes -- but it didn't help the case, where the writes go over NFS.
>
> I assume you mean you removed the SLOG from the pool, in which case you
> most definitely still have a ZIL, its now located on the pool itself.
> Assuming you still have sync=standard I would venture a guess that writes
> over NFS would now be measured in KB/s.
>
>> However, the local writes -- with reads from NFS -- went from the 56Mb/s
>> I was seeing earlier to 90Mb/s!..

56 to 90 is not a large difference.  I think you mentioned factor of 10
differences earlier.

>> There is got to be a better way to do this -- preferably, some
>> self-tuning smarts... Thanks again. Yours,
>>
> There is no getting around the performance impact of a synchronous
> operation, whether its NFS or a database log. If you don't believe me hop
> on your favorite Windows box, bring up the device manager and disable the
> write cache on its drive then run some benchmark supporting sync writes.
> One way to lessen the performance impact is to decrease the latency of
> writes, which is why SSD SLOGs help so much. Which brings me to my next
> point..

But nfs doesn't do sync writes.  As pointed out earlier in this threads,
it does cached writes that is not very different from what other file
systems do.  It writes up to wcommitsize bytes per file and then commits
them.

The default value for wcommitsize is undocumented but according
to the source code it is sqrt(hibufspace) * 256.  This gives about 2.5MB
on i386 with 1GB RAM and 17MB on amd64 with 24GB RAM.  This is not very
large, unless it is actually per-file and there is a backlog of many
files with this much uncommitted data -- then it is too large.

In most file systems, the corresponding limit is per-fs or per-system.
On freefall, vfs.zfs.dirty_data_max is 2.5GB and vfs.hidirtybuffers
is 26502.  2.5GB seems too high to me.  It would take 25 seconds to
drain if it is for a single disk that can do 100MB/s.  26502 is too
high.  It is 1.6GB with the maximum block size of 64K, and it can
easily be for a single disk that is much slower than 100MB/s.  I often
see buffer cache delays of several seconds for backlogs of just a few
MB on a slow (DVD) disk.

When nfs commits the data, it has to do a sync write.  Since wcommitsize
is large, this shouldn't be very slow the file is small so it never reaches
anywhere near size wcommitsize.
   (nfs apparently suffers from the same design errors as the buffer cache.
   Everthing is per-file or per-vnode, so there is no way to combine reads
   or writes even if reads are ahead and writes are long delayed.  Caching
   in drives makes this problem not as large as it was 20-30 years ago, but
   it takes extra i/o's for the small i/o's and some drives haave too low
   an i/o's for their caching to help much).
The implementation might still be fairly stupid and wait for the sync
write to complete.  This is what seems to happen with ffs for the server
fs.  With most mistunings, I get about half of the server speed for nfs
(25MB/s).  The timing with wcommitsize = 25MB might be: accumulate 25MB
and send it to the server at line rate.  My network can only do about
70MB/sec so this takes 0.35 seconds.  Then wait for the server to do
a sync write.  My server can only do about 47MB/s so this takes 0.53
seconds.  Stall writes on the client waiting for the server to confirm
the commit.  Total time 0.88 seconds or 28MB/s.  Observed throughput
more like 25MB/s.  With everything async, I get 39MB/s today and 44MB/s
with slightly diffenty configurations on other days.

2 interesting points turned up or were confirmed in my tests today:
- async on the server makes little difference for large files.  It was
   slightly slower if anything.  This is because the only i/o that I
   tested today was a case that I am ususally not interested in -- large
   writes to a single file.  In this case, almost all of the writes are
   sync for the commit.  The possible reasons for async being slightly
   slower for committing are:
   - a larger backlog
   - bugs in vfs clustering -- some of its async conditions seem to be
     backwards.
- when the server is mounted fully sync, writing on the client is faster
   than on the server, even with the small application buffer size of
   512 on the client and a larger but not maximal buffer size on the
   server!  This is because writes on the client are basically cached.
   They are combined on the server up to a big wcommitsize and done
   with a big sync write, while on the sever if the application writes
   512 at a time it gets sync writes 512 at a time (plus pre-reads of
   the fs block size at a time, but only 1 of these per multiple
   512-writes).

It is easy to have a stupider implementation.  E.g., when nfs commits,
on the server don't give this any priority and get around to it 5-30
seconds later.  Or give it some priority but put it behind the local
backlog of 2.5GB or so.  Give this priority too, but it still takes a
long time since it is so large.  Don't tell the client about the
progress you are making (I think the nfs protocol doesnt have any
partially-committed states).  Maybe zfs is too smart about caching and
it interacts badly with nfs, and ffs interacts better because it is
not so smart.  (I don't even use ffs with soft updates because they
are too smart.)

It is not so easy to have a better implementation, though protocols like
zmodem and tcp have had one for 30-40 years.  Just stream small writes
to the server as fast as you can and let it nack them as fast as it
prefers to commit them to stable storage (never ack, but negative ack
for a problem).  Then if you want to commit a file, tell the server to
give the blocks for that file priority but don't wait for it to finish
before writing more.  Give some priority hints to minimize backlogs.

Changing wcommitsize between 8K and 200MB for testing with a 128MB file
made suprisingly little difference here.

> SSDs are so fast for three main reasons: low latency, large dram buffers,
> and parallel workloads. Only one of these is of any benefit (latency) as a
> SLOG. Unfortunately that particular metric is not usually advertised in
> consumer SSDs where the benchmarks they use to tout 90,000 random write
> iops consist of massively concurrent, highly compressible, short lived
> bursts of data. Add that drive as a SLOG and the onboard dram may as well
> not even exist, and queue depths count for nothing. It will be lucky to
> pull 2,000 IOPS. Once you start adding in ZFS features like checksums and
> compression, or network latency in the case of NFS that 2,000 number starts
> to drop even more.

Latency seems to be unimportant for a big commit.  It is important for
lots of smaller commits if the client (kernel or application) needs to
wait for just one of them.

Bruce