svn commit: r251282 - head/sys/kern

Sat Jun 15 19:56:05 UTC 2013

On Sat, 15 Jun 2013, Konstantin Belousov wrote:

> On Tue, Jun 04, 2013 at 06:14:49PM +1000, Bruce Evans wrote:
>> On Tue, 4 Jun 2013, Konstantin Belousov wrote:
>>
>>> On Mon, Jun 03, 2013 at 02:24:26AM -0700, Alfred Perlstein wrote:
>>>> On 6/3/13 12:55 AM, Konstantin Belousov wrote:
>>>>> On Sun, Jun 02, 2013 at 09:27:53PM -0700, Alfred Perlstein wrote:
>>>>>> Hey Konstaintin, shouldn't this be scaled against the actual amount of
>>>>>> KVA we have instead of an arbitrary limit?
>>>>> The commit changes the buffer cache to scale according to the available
>>>>> KVA, making the scaling less dumb.
>>>>>
>>>>> I do not understand what exactly do you want to do, please describe the
>>>>> algorithm you propose to implement instead of my change.
>>>>
>>>> Sure, how about deriving the hardcoded "32" from the maxkva a machine
>>>> can have?
>>>>
>>>> Is that possible?
>>> I do not see why this would be useful. Initially I thought about simply
>>> capping nbuf at 100000 without referencing any "memory". Then I realized
>>> that this would somewhat conflict with (unlikely) changes to the value
>>> of BKVASIZE due to "factor".
>>
>> The presence of BKVASIZE in 'factor' is a bug.  My version never had this
>> bug (see below for a patch).  The scaling should be to maximize nbuf,
>> subject to non-arbitrary limits on physical memory and kva, and now an
>> arbirary limit of about 100000 / (BKVASIZE / 16384) on nbuf.  Your new
>> limit is arbitrary so it shouldn't affect nbuf depending on BKVASIZE.
>
> I disagree with the statement that the goal is to maximize nbuf. The
> buffer cache currently is nothing more then a header and i/o record for
> the set of the wired pages. For non-metadata on UFS, buffers doenot map
> the pages into KVA, so it becomes purely an array of pointers to page
> and some additional bookkeeping.

Er, since dyson and I designed BKVASIZE with that goal, I know what its
goal is.

> I want to eventually break the coupling between size of the buffer map
> and the nbuf. Right now, typical population of the buffer map is around
> 20%, which means that we waste >= 100MB of KVA on 32bit machines, where
> the KVA is precious. I would also consider shrinking the nbufs much
> lower, but the cost of wiring and unwiring the pages for the buffer
> creation and reuse is the blocking point.

Yes, "some additional bookkeeping" is "a lot of additional bookkeeping"
when nbufs is low relative to the number of active disk blocks.  Small
block sizes expand the number of active disk blocks by a large factor.
E.g., 64 for ffs's default block size of 32K relative to msdosfs's
smallest block size of 512.

This reminds me that I tied to get dyson to implement a better kva
allocation scheme.  At a cost of dividing the nominal number of
buffers by a factor of about 5, but with a gain of avoiding all
fragmentation and all kva allocation overheads, small block sizes
down to size PAGE_SIZE can have as much space allocated for them
(space = number of buffers of this size times block size) as large
blocks.  Use a power of 2 method.  Start with a desired value of nbuf
and sacrifice a large fraction of it: numbers with NOMBSIZE = 16K and
PAGE_SIZE = 4K:

     statically allocate kva for  nbuf/4 buffers of kvasize 64K each
     statically allocate kva for  nbuf/2 buffers of kvasize 32K
     statically allocate kva for  nbuf/1 buffers of kvasize 16K
     statically allocate kva for  2*nbuf buffers of kvasize 8K
     statically allocate kva for  4*nbuf buffers of kvasize 4K

Total allocations: 7.75*nbuf buffers of kvasize 5*nbuf*16K.  To
avoid expanding total kvasize, reduce nbuf by a factor of 5.  This
doesn't work so well for fs block sizes of < 4K.  Allocate many more
than 4*nbuf buffers of size 4K to support them.  Expanding nbuf would
waste kva, but currently, expanding nbuf wastes 4 times as much kva
and also messes up secondary variables like the dirty buffer watermarks.

There is still the cost of mapping buffers into the allocated kva, but
with more buffers of smaller sizes there is less thrashing of the buffers
so less remappings.

When dyson implemented BKVASIZE in 1996, the whole i386 kernel only had
256MB, so fitting enough buffers into it was even harder than.  The i386
kernel kva size wasn't increased to its current 1GB until surprisingly
recently (1999).

> ...
>> BKVASIZE was originally 8KB.  I forget if nbuf was halved by not modifying
>> the scale factor when it was expanded to 16KB.  Probably not.  I used to
>> modify the scale factor to get twice as many as the default nbuf, but
>> once the default nbuf expanded to a few thousand it became large enough
>> for most purposes so I no longer do this.
> Now, with the default UFS block size being 32KB, it is effectively halved
> once more.

Yes, in a bad way for ffs.  When most block sizes are 32K, it is only
possible to use half of nbuf.  Fragmentation occurs if there are mixtures
of 32K-blocks and other block sizes.  Fragmentation wastes time (also
space, but no more than is already wasted statically).  BKVASIZE
should have been doubled to match the doubling of the default block size
   (its comment still hasn't caught up with the previous doublinf of the
   default ffs block size, and still says that BKVASIZE is "2x the block
   size used ny a normal UFS [sic] file system", and warns about the danger
   of making it too small),
but then file systems with smaller block sizes would be penalized.  The
result is similar to that given by my power of 2 method with 2 buffer
sizes:
     statically allocate kva for  nbuf/2 buffers of kvasize 32K
     statically allocate kva for  nbuf/1 buffers of kvasize 16K
except it uses 2/3 as many buffers and 1/2 as much kva as my method, at
a cost of complexity and fragmentation.

Also note that with BKVASIZE = 32K, it is only a factor of 2 away from
MAXBSIZE = 64K (until that is increased), so you could increase BKVASIZE
by another factor of 2 and only halve nbuf by another factor of 2.  The
complexity and fragmentation goes away.

Increasing MAXBSIZE would cause interesting problems.  Fragmentation would
be severe if some block sizes are many more factors of 2 larger than
BKVASIZE.  If MAXBSIZE is really large (say 1MB), then you can't increase
BKVASIZE to it without wasting a really large amount of kva or reducing
nbuf really signficantly, so dynamic sizing becomes necessary again, perhaps
even on 64-bit arches.   Neither MAXBSIZE nor BKVASIZE are kernel options.
BKVASIZE should have been one from the beginning.  An optional MAXBSIZE
hae much wider scope.  For example, systems with a larger MAXBSIZE can
create ffs file systems that cannot be mounted on systems with the
historical MAXBSIZE.

Bruce