Fixes to allow write clustering of NFS writes from a FreeBSD NFS client

Sat Aug 27 11:21:29 UTC 2011

On Thu, 25 Aug 2011, Rick Macklem wrote:

> John Baldwin wrote:
>> ...
>> That converted all the FILESYNC write RPCs from my builds into
>> UNSTABLE
>> write RPCs. The patch for that is at
>> http://www.FreeBSD.org/~jhb/patches/nfsclient_sync_writes.patch.
>>
>> However, even with this change I was still not getting clustered
>> writes on
>> the NFS server (all writes were still 16k). After digging around in
>> the
>> code for a bit I found that ffs will only cluster writes if the passed
>> in
>> 'ioflags' to ffs_write() specify a sequential hint. I then noticed
>> that
>> the NFS server has code to keep track of sequential I/O heuristics for
>> reads, but not writes. I took the code from the NFS server's read op
>> and moved it into a function to compute a sequential I/O heuristic
>> that
>> could be shared by both reads and writes. I also updated the
>> sequential
>> heuristic code to advance the counter based on the number of 16k
>> blocks
>> in each write instead of just doing ++ to match what we do for local
>> file writes in sequential_heuristic() in vfs_vnops.c. Using this did
>> give me some measure of NFS write clustering (though I can't peg my
>> disks at MAXPHYS the way a dd to a file on a local filesystem can).
>> The
>> patch for these changes is at
>> http://www.FreeBSD.org/~jhb/patches/nfsserv_cluster_writes.patch
>>
> The above says you understand this stuff and I don't. However, I will note

I only know much about this part (I once actually understood it).

> that the asynchronous case, which starts the write RPC now, makes clustering
> difficult and limits what you can do. (I think it was done in the bad old

async as opposed to delayed is bad, but is mostly avoided anyway, at
at least ffs and vfs levels on the server.  This was a major optimization
by dysoon about 15 years ago.  I don't understand the sync/async/
delayed writes on the client at the nfs level.  At least the old
nfsclient doesn't even call bawrite(), but it might do the equivalent
using a flag.  On the server, nfs doesn't use any of bwrite/bawrite/bdwrite().
It just uses VOP_WRITE() which does whatever the server file system
does.  Most file systems in FreeBSD use cluster_write() in most cases.
This is from 4.4BSD-Lite.  It replaces an unconditional bawrite() in
Net/2 in the most usual case where the write is of exactly 1 fs-block
(usually starting with a larger write that is split up into fs-blocks
and a possible sub-block at the beginning and end only).  cluster_write()
also has major optimizations by dyson.  In the usual case it turns into
bdwrite(), to give a chance for a full cluster to accumulate, and in
most cases there would be little difference in the effects if the callers
were simplified to call bdwrite() directly.  (The difference is just that
with cluster_write(), a write will occur as soon as a cluster forms,
while with bdwrite() a write will not occur until the next sync unless
the buffer cache is very dirty.  bawrite() used to be used instead of
bdwrite() mainly to reduce pressure on the buffer cache.  It was thought
that the end of a block was a good time to start writing.  That was when
16 buffers containing 4K each was a lot of data :-).  The next and last
major optimization in this area was to improve VOP_FSYNC() to handle a
large number of delayed writes better.  It was changed to uses
vfs_bio_awrite() where in 4.4BSD it used bawrite().  vfs_bio_awrite()
is closer to the implementation and has a better understanding of
clustering than bawrite().  I forget why bawrite() wasn't just replaced
by the internals of vfs_bio_awrite().

sync writes from nfs and O_SYNC from userland tend to defeat all of the
bawrite()/bdwrite() optimizations, by forcing a bwrite().  nfs defaults
to sync writes, so all it can do to use the optimizations is to do very
large sync writes which are split up into smaller delayed ones in a way
that doesn't interfere with clustering.  I don't understand the details
of what it does.

> days to avoid flooding the buffer cache and then having things pushing writes
> back to get buffers. These days the buffer cache can be much bigger and it's
> easy to create kernel threads to do write backs at appropriate times. As such,
> I'd lean away from asynchronous (as in start the write now) and towards delayed
> writes.

On FreeBSD servers, this is mostly handled already by mostly using
cluster_write().  Buffer cache pressure is still difficult to handle
though.  I saw it having bad effects mainly in my silly benchmark for
this nfs server clustering optimization, of writing 1GB.  The buffer
cache would fill up with dirty buffers which take too long to write
(1000-2000 dirty ones out of 8000.  2000 of size 16K each is 32MB.
These take 0.5-1 seconds to write).  While they were being written,
the nfsclient has to stop sending (it shouldn't stop until the buffer
cache is completely full but it does).  Any stoppage gives under-utilization
of the network, and my network has just enough bandwidth to keep up
with the disk.  Stopping for a short time wouldn't be bad, but for
some reason it didn't restart soon enough to keep the writes streaming.
I didn't see this when I repeated the benchmark yesterday.  I must
have done some tuning to reduce the problem, but forget what it was.
I would start looking for it near the buf_dirty_count_severe() test
in ffs_write().  This defeats clustering and may be too agressive or
mistuned.  What I don't like about this is that when severe buffer
cache pressure develops, using bawrite() instead of cluster_write()
tends to increase the pressure, by writing new dirty buffers at half
the speed.  I never saw any problems from the buffer cache pressure
with local disks (except for writing to DVDs, writes often stall near
getblk() for several seconds).

> If the writes are delayed "bdwrite()" then I think it is much easier
> to find contiguous dirty buffers to do as one write RPC. However, if
> you just do bdwrite()s, there tends to be big bursts of write RPCs when
> the syncer does its thing, unless kernel threads are working through the
> cache doing write backs.

It might not matter a lot (except on large-latency links) what the client
does.  MTUs of only 1500 are still too common, so there is a lot of
reassumble of blocks at the network level.  A bit more at the RPC and
(both client and server) block level won't matter provided you don't
synchronize after every piece.

Hmm, those bursts on the client aren't so good, and may explain why
the client stalled in my tests.  At least the old nfs client never
uses either cluster_write() or vfs_bio_awrite() (or bawrite()).  I
don't understand why, but if if uses bdwrite() when it should use
cluster_write() then it won't have the advantage of cluster_write()
over bdwrite() -- of writing as soon as a cluster forms.  It does use
B_CLUSTEROK.  I think this mainly causes clustering to work when all
the delayed-write buffers are written eventually.  Now I don't see
much point in using either delayed writes or clustering on the client.
Clustering is needed for unsolid state disks mainly because their seek
time is so large.  Larger blocks are only good for their secondary
effects of reducing overheads and latency.

> Since there are nfsiod threads, maybe these could scan for contiguous
> dirty buffers and start big write RPCs for them? If there was some time
> limit set for how long the buffer sits dirty before it gets a write started
> for it, that would avoid a burst caused by the syncer.

One of my tunings was to reduce the number of nfsiod's.

> Also, if you are lucky w.r.t. doing delayed writes for temporary files, the
> file gets deleted before the write-back.

In ffs, this is another optimization by dyson.  Isn't it defeated by
sync writes from ffs?   Is it possible for a file written on the client
to never reach the server?  Even if the data doesn't, I think the
directory and inode creation should.  Even for ffs mounted async, I
think there are writes of some metadata for deleted files, because
although the data blocks are dead, some metadata blocks like ones for
inodes are shared with other files and must have been dirtied by create
followed by delete, so they remain undead but are considered dirty
although their accumulated changes should be null.  The writes are
just often coalesced by the delay, so instead of 1000 of writes to the
same place for an inode that is created and deleted 500 times, you get
just 1 write for null changes at the end.  My version of ffs_update()
has some optimizations to avoid writing null changes, but I think this
doesn't help here since it still sees the changes in-core as they
occur.

Bruce