(in)appropriate uses for MAXBSIZE

Sun Apr 11 02:56:07 UTC 2010

On Fri, 9 Apr 2010, Andriy Gapon wrote:

> on 09/04/2010 16:53 Rick Macklem said the following:
>>
>>
>> On Fri, 9 Apr 2010, Andriy Gapon wrote:
>>
>>>
>>> Nowadays several questions could be asked about MAXBSIZE.
>>> - Will we have to consider increasing MAXBSIZE?  Provided ever
>>> increasing media
>>> sizes, typical filesystem sizes, typical file sizes (all that
>>> multimedia) and
>>> even media sector sizes.
>>
>> I would certainly like to see a larger MAXBSIZE for NFS. Solaris10
>> currently uses 128K as a default I/O size and allows up to 1Mb.

Er, the maximum size of buffers in the buffer cache is especially
irrelevant for nfs.  It is almost irrelevant for physical disks because
clustering normally increases the bulk transfer size to MAXPHYS.
Clustering takes a lot of CPU but doesn't affect the transfer rate much
unless there is not enough CPU.  It is even less relevant for network
i/o since there is a sort of reverse-clustering -- the buffers get split
up into tiny packets (normally 1500 bytes less some header bytes) at
the hardware level.  Again a lot of CPU is involved doing the (reverse)
clustering, and again this doesn't affect the transfer rate much.
However, 1500 is so tiny that the reverse-clustering ratio of the i/o
size relative to MAXBSIZE (65536/1500) is much smaller than the normal
clustering ratio relative to MAXBSIZE (132768/65536) and the extra CPU
is more significant for network i/o.  (These aren't the actual normal
ratios, but ones the limits of the attainable ones by varying only the
block sizes under the file system's control.)  However2, increasing the
network i/o size can make little difference to this problem -- it can
only increase the already-too-large reverse-clustering ratio, while
possibly reducing other reverse-clustering ratios (the others are for
assembling the nfs buffers from local file system buffers; the local
file system buffers are normally disassembled from pbuf size (MAXPHYS)
to file system size (normally 16K); then conversion to nfs buffers
involves either a sort of clustering or reverse clustering depending
on the relative sizes of the buffers).  There are more gains to be
had from increasing the network i/o size.  tcp allows larger buffers
at intermediate levels but they still get split up at the hardware
level.  Only some networks allow jumbo frames.

>> Using
>> larger I/O sizes for NFS is a simpler way to increase bulk data transfer
>> rate than more buffers and more agressive read-ahead/write-behind.

I'm not sure about that.  Read-ahead and write-behind is already very
aggressive but seems to be not working right.  I use some patches by
Bjorn Groenwald (?) which make it work better for the old nfs implemenation
(I haven't tried the experimental one).  The problems seem to be mainly
timing ones.  vfs clustering makes the buffer sizes almost irrelevant for
physical disks, but there are latency problems for the network i/o.
The latency problems seem to be larger for reads than for writes.  I
get best results by using the same size for network buffers as for local
buffers (16K).  This avoids 1 layer of buffer size changing (see above)
and using 16K-buffers avoids buffer kva fragmentation (see below).  I
saw little difference from changing the user buffer size, except small
buffers tend to work better and smallest (512-byte) buffers may have
actually worked best, I think by reducing latencies.

> I have lightly tested this under qemu.
> I used my avgfs:) modified to issue 4*MAXBSIZE bread-s.
> I removed size > MAXBSIZE check in getblk (see a parallel thread "panic: getblk:
> size(%d) > MAXBSIZE(%d)").

Did you change the other known things that depend on this?  There is the
b_pages limit of MAXPHYS bytes which should be checked for in another
way, and the soft limits for hibufspace and lobufspace which only matter
under load conditions.

> And I bumped MAXPHYS to 1MB.
>
> Some results.
> I got no panics, data was read correctly and system remained stable, which is good.
> But I observed reading process (dd bs=1m on avgfs) spending a lot of time sleeping
> on needsbuffer in getnewbuf.  needsbuffer value was VFS_BIO_NEED_ANY.
> Apparently there was some shortage of free buffers.
> Perhaps some limits/counts were incorrectly auto-tuned.

This is not surprising, since even 64K is 4 times too large to work
well.  Buffer sizes of larger than BKVASIZE (16K) always cause
fragmentation of buffer kva.  Recovering from fragmentation always
takes a lot of CPU, and if you are unlucky it will also take a lot of
real time (stalling waiting for free buffer kva).  Buffer sizes larger
than BKVASIZE also reduce the number of available buffers significantly
below the number of buffers configured.  This mainly takes a lot of
CPU to reconsitute buffers.  BKVASIZE being less than MAXBSIZE is a
hack to reduce the amount of kva statically allocated for buffers for
systems that cannot support enough kva to work right (mainly i386's).
It only works well when it is not actually used (when all buffers have
size <= BKVASIZE = 16K, as would be enforced by reducing MAXBSIZE to
BKVASIZE).  This hack and the complications to support it are bogus on
systems that support enough kva to work right.

nfs buffers larger than 16K would exceed BKVASIZE.  This may have been
why nfs buffer sizes of size 32K gave negative benefits.

Bruce