(in)appropriate uses for MAXBSIZE

Wed Apr 14 04:40:50 UTC 2010

On Sun, 11 Apr 2010, Rick Macklem wrote:

> On Sun, 11 Apr 2010, Bruce Evans wrote:
>
>> Er, the maximum size of buffers in the buffer cache is especially
>> irrelevant for nfs.  It is almost irrelevant for physical disks because
>> clustering normally increases the bulk transfer size to MAXPHYS.
>> Clustering takes a lot of CPU but doesn't affect the transfer rate much
>> unless there is not enough CPU.  It is even less relevant for network
>> i/o since there is a sort of reverse-clustering -- the buffers get split
>> up into tiny packets (normally 1500 bytes less some header bytes) at
>> the hardware level.  ...
>
> I've done a simple experiment on Mac OS X 10, where I tried different
> sizes for the read and write RPCs plus different amounts of
> read-ahead/write-behind and found the I/O rate increased linearly,
> up to the max allowed by Mac OS X (MAXBSIZE == 128K) without 
> read-ahead/write-behind. Using read-ahead/write-behind the performance
> didn't increase at all, until the RPC read/write size was reduced.
> (Solaris10 is using 256K by default and allowing up to 1Mb for read/write
> RPC size now, so they seem to think that large values work well?)
>
> When you start using a WAN environment, large read/write RPCs really
> help, from what I've seen, since that helps fill the TCP pipe
> (bits * latency between client<->server).
>
> I care much more about WAN performance than LAN performance w.r.t. this.

Indeed, I was only caring about a LAN environment.  Especially with
LANs optimized for latency (50-100 uS), nfs performance is poor for
small files, at least for the old nfs client, mainly due to close to
open consistency defeating caching, but not a problem for bulk transfers.

> I am not sure what you were referring to w.r.t. clustering, but if you
> meant that the NFS client can easily do an RPC with a larger I/O size
> than the size of the buffer handed it by the buffer cache, I'd like to
> hear how that's done? (If not, then a bigger buffer from the buffer
> cache is what I need to do a larger I/O size in the RPC.)

Clustering is currently only for the local file system, at least for
the old nfs server.  nfs just does a VOP_READ() into its own buffer,
with ioflag set to indicate nfs's idea of sequentialness.  (User reads
are similar except their uio destination is UIO_USERSPACE instead of
UIO_SYSSPACE and their sequentialness is set generically and thus not
so well (but the nfs setting isn't very good either).)  The local file
system then normally does a clustered read into a larger buffer, with
the sequentialness affecting mainly startup (per-file), and virtually
copies the results to the local file system's smaller buffers.  VOP_READ()
completes by physically copying the results to nfs's buffer (using
bcopy() for UIO_SYSSPACE and copyout() for UIO_USERSPACE).  nfs can't
easily get at the larger clustering buffers or even the local file
system's buffers.  It can more easily benefit from larger MAXBSIZE.
There is still the bcopy() to take a lot of CPU and memory bus resources,
but that is insignifcant compared with WAN latency.  But as I said in
a related thread, even the current MAXBSIZE is too large to use
routinely, due to buffer cache fragmentation causing significant latency
problems, so any increase in MAXBSIZE and/or routine use of buffers
of that size needs to be accompanied by avoiding the fragmentation.
Note that the fragmentation is avoided for the larger clustering buffers
by allocating them from a different pool.

Bruce