question about sb->st_blksize in src/sys/kern/vfs_vnops.c

Thierry Herbelot thierry.herbelot at laposte.net
Sat Oct 25 15:05:45 UTC 2008


Le Saturday 25 October 2008, Bruce Evans a écrit :
> On Fri, 24 Oct 2008, Thierry Herbelot wrote:
> > the [SUBJ] file contains the following extract (around line 705) :
> >
> >     * Default to PAGE_SIZE after much discussion.
> >     * XXX: min(PAGE_SIZE, vp->v_bufobj.bo_bsize) may be more correct.
> >     */
> >
> >    sb->st_blksize = PAGE_SIZE;
> >
> > which arrived around four years ago, with revision 1.211 (see
> > http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/kern/vfs_vnops.c.diff?r1=1.
> >210;r2=1.211;f=h)
>
> Indeed, this was completely broken long ago (in 1.211).  Before then, and
> after 1.128, some cases worked as intended if not perfectly:
> - regular files: file systems still set va_blksize to their idea of the
>    best i/o size (normally to the file system block size, which is
>    normally larger than PAGE_SIZE and probably better in all cases) and
>    this was used here.  However, for regular files, the fs block size
>    and the application's i/o size are almost irrelevant in most cases
>    due to vfs clustering.  Most large i/o's are done physically with
>    the cluster size (which due to a related bug suite ends up being
>    hard-coded to MAXPHYS (128K) at a minor cost when this is different
>    from the best size).
> - disk files: non-broken device drivers set si_iosize_best to their idea
>    of the best i/o size (normally to the max i/o size, which is normally
>    better than PAGE_SIZE) and this was used here.  The bogus default
>    of BLKDEV_IOSIZE was used for broken drivers (this is bogus because it
>    was for the buffer cache implementation for block devices which no
>    longer exist and was too small for them anyway).
> - non-disk character-special files: the default of PAGE_SIZE was used.
>    The comment about defaulting to PAGE_SIZE was added in 1.128 and is
>    mainly for this case.  Now the comment is nonsense since the value is
>    fixed, not a default.
> - other file types (fifos, pipes, sockets, ...): these got the default of
>    PAGE_SIZE too.
>
> In rev.1.1, st_blksize was set to va_blksize in all cases.  So file systems
> were supposed to set va_blksize reasonably in all cases, but this is not
> easy and they did nothing good except for regular files.

agreed, anyway the comment by phk about using ioctl(DIOCGSECTORSIZE) applies.
>
> Versions between 1.2 and 1.127 did weird things like defaulting to DFLTPHYS
> (64K) for most cdevs but using a small size like BLKDEV_IOSIZE (2K) for
> disks. This gave nonsense like 64K buffers for slow tty devices (keyboards)
> and 2K buffers for fast disks.  At least for programs that trust st_blksize
> o be reasonable.  Fortunately, st_blsize is rarely used...
>
> > the net effect of this change is to decrease the block buffer size used
> > in libc/stdio from 16 kbytes (derived from the underlying ufs partition)
> > to PAGE_SIZE ==4 kbytes (fixed value), and consequently the I/O bandwidth
> > is lowered (this is on a slow Flash).
>
> ... except it is used by stdio.  (Another mess here is that stdio mostly
> doesn't use its own BUFSIZ.  It trusts st_blksize if fstat() to determine

This is indeed what I saw, meandering between the libc and the vfs part of the 
kernel.

In fact, I was essentially wondering if st_blksize was used *elsewhere*, and 
bumping the value could break some memory allocation ...

> st_blksize works.  Of course, the existence of BUFSIZ is a related
> historical mistake -- no fixed size can work best for all cases.  But
> when BUFSIZ is used, it is an even worse default than PAGE_SIZE.)

(as it is even smaller ?)
>
> It's interesting that you can see the difference.  Clustering is especially
> good for hiding slowness on slow devices.  Maybe you are using a
> configuration that makes clustering ineffective.  Mounting the file system
> with -o sync or equivalently, doing a sync after every (too-small) write
> would do it. Otherwise, writes are normally delated until the next cluster
> boundary.

My use case is for small (buffered) writes to a file between 4 kbytes and 16 
16 kbytes.

For example, writing a 16-kbyte file with a st_blksize of 4k is twice as slow 
as with 16k (220 ms compared to 110). The penalty is less for 8k-byte (105 ms 
vs 66).
>
> > I have patched the kernel with a larger, fixed value (simply 4*PAGE_SIZE,
> > to revert to the block size previoulsly used), and the kernel and world
> > seem to be running fine.
> >
> > Seeing the XXX coment above, I'm a bit worried about keeping this new
> > st_blksize value.
> >
> > are there any drawbacks with running with this bigger buffer size value ?
>
> Mostly it doesn't matter, since buffering (clustering) hides the
> differences.

(as seen before, mostly)

> Without clustering, 16K is a much better default for disks 
> than 4K, though not as good as the non-default va_blksize for regular
> files.  Newer disks might prefer 32K or 64k, but then the fs block size
> should also be increased from 16K.  Otherwise, increasing the block size
> usually reduces performance, by thrashing caches or increasing latencies. 
> With modern cache sizes and disk speeds, you won't see these effects for a
> block size of 64K, so defaulting to 64K would be reasonable for disks.  It
> would be silly for keyboards, but with modern memory sizes you would notice
> this even less than when it was that in old versions.

OK, thanks for the answer : I will submit the change to more stress tests and 
hope to shake it all before putting it to production.

	TfH
>
> Bruce




More information about the freebsd-hackers mailing list