calling all fs experts

Sun Dec 11 20:24:25 UTC 2011

--- Dom 11/12/11, Kostik Belousov <kostikbel at gmail.com> ha scritto:

> 
> If you wanted to get responses from experts only, sorry in
> advance.
>

I am no fs expert but just thought I'd mention some things
based on my playing with the BSD ext2fs ...

> The fs (AKA UFS) uses clustering provided by the block
> cache. The clustering
> code, mainly located in the kern/vfs_cluster.c, coalesces
> sequence of
> reads or writes that are targeting the consequtive blocks,
> into single
> physical read or write of the maximal size of MAXPHYS.
> Current definition
> of MAXPHYS is 128KB.
>

The clustering code is really cool and the idea is that it
gives UFS the advantages of an extent based fs.
I haven't seen benchmarks in UFS2 but on ext2 it didn't
seem to work as it should though. 

One issue is that ext2 doesn't support fragments and as
a consequence ext2 will not use big blocksizes. This is a
limitation in the ext2 design that UFS doesn't have, but
still linux's ext2fs outperforms UFS in async mode (we do
shine in sync mode).

It was never clear exactly why this happens but it would
appear there is a bottleneck in geom that is not good in
writing many contiguous blocks.

> Clustering allows filesystem to improve the layout of the
> files by calling
> VOP_REALLOCBLKS() to redo the allocation to make the
> writing sequence of
> blocks sequential if it is not.
> 
> Even if file is not layed out ideally, or the i/o pattern
> is random, most
> writes scheduled are asynchronous, and for reads, the
> system tries to
> schedule read-aheads for some limited number of blocks.
> This allows the
> lower layers, i.e. geom and disk drivers, to optimize the
> i/o queue
> to coalesce requests that are consequitive on disk, but not
> on the queue.
> 
> BTW, some time ago I was interested in the effect on the
> fragmentation
> on UFS, due to some semi-abandoned patch, which could make
> the
> fragmentation worse. I wrote the tool that calculated the
> percentage
> of non-consequtive spots in the whole filesystem.
> Apparently, even
> under the hard load consisting of writing a lot of files
> under the
> megabytes in size, UFS managed to keep the number of spots
> under 2-3% on
> sufficiently free volume.
> 

Yes, the realloc_blk code is very efficient in that. In fact
it is so good it actually hides some inefficient operations
in UFS. Bruce had a patch for this that I cc'd to Kirk but
the difference was not big because the realloc_blk code does
it's job in memory.

Zheng Liu did the reallocation thing for ext2fs and it gave
better results than preallocation but the results are not
as spectacular as in UFS (the UFS code takes advantage of
fragments there too). I do expect to commit it (kern/159233)
once my mentor reviews and approves it.

cheers,

Pedro.