question about sb->st_blksize in src/sys/kern/vfs_vnops.c
Thierry Herbelot
thierry.herbelot at laposte.net
Sat Oct 25 15:05:45 UTC 2008
Le Saturday 25 October 2008, Bruce Evans a écrit :
> On Fri, 24 Oct 2008, Thierry Herbelot wrote:
> > the [SUBJ] file contains the following extract (around line 705) :
> >
> > * Default to PAGE_SIZE after much discussion.
> > * XXX: min(PAGE_SIZE, vp->v_bufobj.bo_bsize) may be more correct.
> > */
> >
> > sb->st_blksize = PAGE_SIZE;
> >
> > which arrived around four years ago, with revision 1.211 (see
> > http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/kern/vfs_vnops.c.diff?r1=1.
> >210;r2=1.211;f=h)
>
> Indeed, this was completely broken long ago (in 1.211). Before then, and
> after 1.128, some cases worked as intended if not perfectly:
> - regular files: file systems still set va_blksize to their idea of the
> best i/o size (normally to the file system block size, which is
> normally larger than PAGE_SIZE and probably better in all cases) and
> this was used here. However, for regular files, the fs block size
> and the application's i/o size are almost irrelevant in most cases
> due to vfs clustering. Most large i/o's are done physically with
> the cluster size (which due to a related bug suite ends up being
> hard-coded to MAXPHYS (128K) at a minor cost when this is different
> from the best size).
> - disk files: non-broken device drivers set si_iosize_best to their idea
> of the best i/o size (normally to the max i/o size, which is normally
> better than PAGE_SIZE) and this was used here. The bogus default
> of BLKDEV_IOSIZE was used for broken drivers (this is bogus because it
> was for the buffer cache implementation for block devices which no
> longer exist and was too small for them anyway).
> - non-disk character-special files: the default of PAGE_SIZE was used.
> The comment about defaulting to PAGE_SIZE was added in 1.128 and is
> mainly for this case. Now the comment is nonsense since the value is
> fixed, not a default.
> - other file types (fifos, pipes, sockets, ...): these got the default of
> PAGE_SIZE too.
>
> In rev.1.1, st_blksize was set to va_blksize in all cases. So file systems
> were supposed to set va_blksize reasonably in all cases, but this is not
> easy and they did nothing good except for regular files.
agreed, anyway the comment by phk about using ioctl(DIOCGSECTORSIZE) applies.
>
> Versions between 1.2 and 1.127 did weird things like defaulting to DFLTPHYS
> (64K) for most cdevs but using a small size like BLKDEV_IOSIZE (2K) for
> disks. This gave nonsense like 64K buffers for slow tty devices (keyboards)
> and 2K buffers for fast disks. At least for programs that trust st_blksize
> o be reasonable. Fortunately, st_blsize is rarely used...
>
> > the net effect of this change is to decrease the block buffer size used
> > in libc/stdio from 16 kbytes (derived from the underlying ufs partition)
> > to PAGE_SIZE ==4 kbytes (fixed value), and consequently the I/O bandwidth
> > is lowered (this is on a slow Flash).
>
> ... except it is used by stdio. (Another mess here is that stdio mostly
> doesn't use its own BUFSIZ. It trusts st_blksize if fstat() to determine
This is indeed what I saw, meandering between the libc and the vfs part of the
kernel.
In fact, I was essentially wondering if st_blksize was used *elsewhere*, and
bumping the value could break some memory allocation ...
> st_blksize works. Of course, the existence of BUFSIZ is a related
> historical mistake -- no fixed size can work best for all cases. But
> when BUFSIZ is used, it is an even worse default than PAGE_SIZE.)
(as it is even smaller ?)
>
> It's interesting that you can see the difference. Clustering is especially
> good for hiding slowness on slow devices. Maybe you are using a
> configuration that makes clustering ineffective. Mounting the file system
> with -o sync or equivalently, doing a sync after every (too-small) write
> would do it. Otherwise, writes are normally delated until the next cluster
> boundary.
My use case is for small (buffered) writes to a file between 4 kbytes and 16
16 kbytes.
For example, writing a 16-kbyte file with a st_blksize of 4k is twice as slow
as with 16k (220 ms compared to 110). The penalty is less for 8k-byte (105 ms
vs 66).
>
> > I have patched the kernel with a larger, fixed value (simply 4*PAGE_SIZE,
> > to revert to the block size previoulsly used), and the kernel and world
> > seem to be running fine.
> >
> > Seeing the XXX coment above, I'm a bit worried about keeping this new
> > st_blksize value.
> >
> > are there any drawbacks with running with this bigger buffer size value ?
>
> Mostly it doesn't matter, since buffering (clustering) hides the
> differences.
(as seen before, mostly)
> Without clustering, 16K is a much better default for disks
> than 4K, though not as good as the non-default va_blksize for regular
> files. Newer disks might prefer 32K or 64k, but then the fs block size
> should also be increased from 16K. Otherwise, increasing the block size
> usually reduces performance, by thrashing caches or increasing latencies.
> With modern cache sizes and disk speeds, you won't see these effects for a
> block size of 64K, so defaulting to 64K would be reasonable for disks. It
> would be silly for keyboards, but with modern memory sizes you would notice
> this even less than when it was that in old versions.
OK, thanks for the answer : I will submit the change to more stress tests and
hope to shake it all before putting it to production.
TfH
>
> Bruce
More information about the freebsd-hackers
mailing list