BKVASIZE for large block-size filesystems

Wed May 25 17:38:58 PDT 2005

On Wed, 25 May 2005, Sven Willenberger wrote:

> [originally posted to freebsd-stable, realized that some amd64-specific
> info may be needed here too]

It's not very amd64-specific due to bugs.  BKVASIZE and algorithms that
use it are tuned for i386's.  This gives mistuning for arches that have
more kernel virtual address space.

> FreeBSD5.4-Stable amd64 on a dual-opteron system with LSI-Megaraid 400G+
> partion. The filesystem was created with: newfs -b 65536 -f 8192 -e
> 15835 /dev/amrd2s1d
>
> This is the data filesystem for a PostgreSQL database; as the default
> page size (files) is 8k, the above newfs scheme has 8k fragments which
> should fit nicely with the PostgreSQL page size. Now by default param.h

Fragments don't work very well.  It might be better to fit files to the
block size.  If all files had size 8K, then -b 8192 -f 8192 would work
best (slightly better than -b 8192 -f 1024, that slightly better than
the current defaults, and all much better than -b 65536 -f 8192).

> defines BKVASIZE as 16384 (which has been pointed out in other posts as
> being *not* twice the default blocksize of 16k). I have modified it to
> be set at 32768 but still see a high and increasing value of
> vfs.bufdefragcnt which makes sense given the blocksize of the major
> filesystem in use.

Yes, a block size larger than BKVASIZE will cause lots of fragmentation.
I'm not sure if this is still a large pessimization.

> My question is are there any caveats about increasing BKVASIZE to 65536?
> The system has 8G of RAM and I understand that nbufs decreases with
> increasing BKVASIZE;

The decrease in nbufs is a bug.  It defeats half of the point of increasing
BKVASIZE: if most buffers have size 64K, then increasing BKVASIZE from 16K
to 64K gives approximately nbuf/4 buffers all of size 64K instead of nbuf
buffers, with nbuf/4 of them of size 64K and 3*nbuf/4 of them unusable.
Thus it avoids some resource wastage at a cost of possibly not using enough
resources for effective caching.  However, little is lost if most buffers
have size 64K.  Then the reduced nbuf consumes all of the kva resources that
we are willing to allocate.  The problem is when file systems are mixed and
ones with a block size of 64K are not used much or at all.  The worst case
is when all blocks have size 512, which can happen for msdosfs.  Then up
to (BKVASIZE - 512) / BKVASIZE of the kva resource is wasted (> 99% for
BKVASIZE = 65536 but only 97% for BKVASIZE = 16384).

To fix the bug, change BKVASIZE in kern_vfs_bio_buffer_alloc() to 16384
and consider adjusting the machbcache tunable (see below).

> how can I either determine if the resulting nbufs
> will be sufficient or calculate what is needed based on RAM and system
> usage?

nbuf is not directly visible except using a debugger, but vfs.maxbufspace
gives it indirectly -- divide the latter by BKVASIZE to get nbuf.  A few
thousand for it is plenty.

I used to use BKVASIZE = 65536, and fixed the bug as above, and also doubled
nbuf in kern_vfs_bio_buffer_alloc(), and also configured VM_BCACHE_SIZE_MAX
to 512M so that the elevated nbuf was actually used, but the need for
significantly increasing the default nbuf (at least with BKVASIZE = 16384)
went away many years ago when memory sizes started exceeding 256M or so.
My doubling of nbuf broke a few years later when memory sizes started
exceeding 1GB.  i386's just don't have enough virtual address space to use
a really large nbuf, so when there is enough physical memory the default
nbuf is as large as possible.  I was only tuning BKVASIZE and
VM_BCACHE_SIZE_MAX to benchmark file systems with large block sizes, but
the performance with large block sizes was poor even with this tuning so
I lost interest in it.  Now I just use the defaults and the bug fix
reduces to a spelling change.  nbuf defeaults to about 7000 on my machines
with 1GB of memory.  This is plenty.  With BKVASIZE = 64K and without the
fix, it would be 1/4 as much, which seems a little low.

nbuf is also limited by kernel virtual memory.  amd's have more (I'm not
sure how much), and they should have so much more that the bcache part
is effectively infinity, but it is or was actually only twice as much
as on i386's (default VM_BCACHE_SIZE_MAX = 200MB on i386's and 400MB
on amd64's).  Even i386's can spare more provided the memory is not
needed for other things, e.g., networking.  The default of 400MB on
amd64's combined with BKVASIZE  gives a limit on nbuf of 400MB/64K = 6400
which is plently, so you shouldn't need to change the bcache tunable.

> Also, will increasing BKVASIZE require a complete make buildworld or, if
> not, how can I remake the portions of system affected by BKVASIZE?

It's not a properly supported option, so the way to change it is to
edit it in the sys/param.h source file.  After changing it there,
the everything will be rebuilt as necessary by makeworld and/or
rebuilding kernels.  Unfortunately, almost everything will be rebuilt
because too many things depend on sys/param.h.  When testing
changes to BKVASIZE, I used to cheat by preserving the timestamp of
sys/param.h and manually recompiling only the necessary things.  Very
little depends on BKVASIZE.  IIRC, there used to be 2 object files
per kernel, but now there is only 1 (vfs_bio.o).

Bruce