Writing contigiously to UFS2?

Sat Sep 22 05:37:28 PDT 2007

On Sat, 22 Sep 2007, Bruce Evans wrote:

> The current (and very old?) allocation policy for extending files is
> to consider allocating the block in a new cg when the current cg has
> more than fs_maxbpg blocks allocated in it (newfs and tunefs parameter
> -e maxbpg: default 1/4 of the number of blocks in a cg = bpg).  Then
> preference is given to the next cg with more than the average number
> of free blocks.  This seems to be buggy.  From ffs_blkpref_ufs1():

Actually, it is almost as good as possible.

Note that ffs_blkpref_*() only gives a preference, so it shouldn't try
too hard.  Also, it is only used if block reallocation is enabled (the
default: sysctl ffs.vfs.doreallocblks=1).  Then it gives delayed
allocation.  Block reallocation generally does a good job of realocating
blocks contiguously.  I don't know exactly where/how the allocation
is done if block reallocation is not enabled, but certainly, maxbpg
is not used then.

> % 	if (indx % fs->fs_maxbpg == 0 || bap[indx - 1] == 0) {
> % 		if (lbn < NDADDR + NINDIR(fs)) {
> % 			cg = ino_to_cg(fs, ip->i_number);
> % 			return (cgbase(fs, cg) + fs->fs_frag);
> % 		}
>
> I think "indx" here is the index into an array of block pointers in
> the inode or an indirect block.

This is correct.

> So for extending large files it is
> always into an indirect block.  It gets reset to 0 for each new indirect
> block.  This makes its use in (indx % fs->fs_maxbpg == 0) dubious.
> The condition is satisfied whenever:
> - indx == 0, i.e., always at the start of a new indirect block.  Not
>  too bad, but not what we want if fs_maxbpg is much larger than the
>  number of indexes in an indirect block.

Actually, this case is handled quite well later.

> - index == nonzero multiple of number of indexes in an indirect block.
>  This condition is never be satisfied if fs_maxbpg is larger than
>  the number of indexes in an indirect block.  This is the usual case
>  for ffs2 (only 2K indexes in 16K-blocks, and fairly large cg's).  On
>  an ffs1 fs that I have handy, maxbpg is 2K and the number of indexes
>  is 4K, so this condition is satisfied once.

This case is not handled well.  The bug for it is mainly in newfs.
>From newfs.c:

% /*
%  * MAXBLKPG determines the maximum number of data blocks which are
%  * placed in a single cylinder group. The default is one indirect
%  * block worth of data blocks.
%  */
% #define MAXBLKPG(bsize)	((bsize) / sizeof(ufs2_daddr_t))

The comment is correct, but the code is wrong for ffs2.  Then
MAXBLKPG defaults to half an indirect block worth of data blocks.
I just use the default, so my ffs1 fs has maxbpg instead of 4K.

> The (bap[indx - 1] == 0) condition causes a move to a new cg after
> every hole.

Actually, this case is handled well later.

> This may help by leaving space to fill in the hole, but
> it is wrong if the hole will never be filled in or is small.  This
> seems to be just a vestige of code that implemented the old rotdelay
> pessimization.

Actually, it is still needed for using bap[indx - 1] at the end of function.

> Comments saying that we use fs_maxcontig near here
> are obviously vestiges of the pessimization.
>
> % 		/*
> % 		 * Find a cylinder with greater than average number of
> % 		 * unused data blocks.
> % 		 */
> % 		if (indx == 0 || bap[indx - 1] == 0)
> % 			startcg =
> % 			    ino_to_cg(fs, ip->i_number) + lbn / 
> fs->fs_maxbpg;
>
> At the start of an indirect block, and after a hole, we don't know where
> the previous block was so we use the cg of the inode advanced by the
> estimate (lbn / fs->fs_maxbpg) of how far we have advanced from the cg
> of the inode.  I think this estimate is too primitive to work right even
> a small fraction of the time.  Adjustment factors related to the number
> of maxbpg's per block of indexes and the fullness of the disk seem to
> be required.  Keeping track of the cg of the previous block would be better.

Actually, this estimate works very well.  We _want_ to change to a new
cg after every maxpg blocks.  The estimate gives the closest cg that
is possible if all the blocks are allocated as contiguously as we want.
If the disk is nearly full we will probably have to go further.  Starting
the search at the closest cg that we want gives a bias towards close
cg's that are not too close.

> % 		else
> % 			startcg = dtog(fs, bap[indx - 1]) + 1;
>
> Now there is no problem finding the cg of the previous block.  Note that
> we always add 1...
>
> % 		startcg %= fs->fs_ncg;
> % 		avgbfree = fs->fs_cstotal.cs_nbfree / fs->fs_ncg;
> % 		for (cg = startcg; cg < fs->fs_ncg; cg++)
>
> ... so the search gives maximal non-preference to the cg of the previous
> block.  I think things would work much better if we considered the
> current cg, if any, first (current cg = one containing previous block), and
> we actually know that cg.  This would be easy to try -- just don't add 1.
> Also try not adding the bad estimate (lbn / fs->fs_maxbpg), so that the
> search starts at the inode's cg in some cases -- then previous cg's will
> be reconsidered but hopefully the average limit will prevent them being
> used.

Actually, adding 1 is correct in most cases.  Here we think we have just
allocated maxcontig blocks in the current cg, so we _want_ to advance to
the next cg.  The problem is that we don't really know that we have
allocated that many blocks.  We have lots of previous block numbers in
bap[] and could inspect many of them, but we only inspect the previous
one.  The corresponding code in 4.4BSD is better -- it inspects the
one some distance before the previous one.  The corresponding diistance
here is maxbpg.  We could inspect the blocks at 1 previous and maxbpg
previous to quickly estimate if we have allocated all of the previous
maxbpg blocks in the same cylinder group.

> Note that in the calculation of avgbfree, division by ncg gives a
> granularity of ncg, so there is an inertia of ncg blocks against moving
> to the next cg.  A too-large ncg is a feature here.

This feature shouldn't make much difference, but we don't want it if we
are certain that we have just allocated maxbpg blocks in a cg.

Analysis of block layouts for a 200MB file shows no large problems in
this area, but some small ones.  This is with some problems already
fixed.  200MB is a bit small but gives data small enough to understand
easily.  The analysis is limited to ffs1 since I only have a layout-
printing program for that.  I don't use ffs2 and haven't fixed the
"some" problems for it.  Perhaps they are the ones that matter here.
(For what they are, see below.)

ffs1, no soft updates (all tests on an almost-new fs):

% fs_bsize = 16384
% fs_fsize = 2048
% 4:	lbn 0-11 blkno 1520-1615
% 	lbn [<1>indir]12-4107 blkno 1616-1623
% 	lbn 12-4107 blkno 1624-34391

Everything is perfectly configuous until here.  Without my fixes, the first
indirect block in the middle tends to be allocated discontiguously.  Here
lbn's have size fs_bsize = 16K, and blkno's have size fs_fsize = 2K; "4:"
is just the inode number; "[<n>indir>]" is an nth indirect block.

% 	lbn [<2>indir]4108-16781323 blkno 189592-189599

Bug.  cg's have size about 94000 in blkno units.  We have skipped the
entire second cg.

% 	lbn [<1>indir]4108-8203 blkno 189600-189607
% 	lbn 4108-6155 blkno 189608-205991

All contiguous.

% 	lbn 6156-8203 blkno 283640-300023

This is from the newfs bug (default maxbpg = half an indirect block's
worth of blkno's).  Here we advance to the next cg half way through
the indirect block.  The advance is only about 90000 blkno's so it
correctly doesn't skip a cg.

% 	lbn [<1>indir]8204-12299 blkno 377688-377695
% 	lbn 8204-10251 blkno 377696-394079
% 	lbn 10252-12299 blkno 471736-488119
% 	lbn [<1>indir]12300-16395 blkno 565784-565791
% 	lbn 12300-12799 blkno 565792-569791

The pattern continues with no problems except the default maxbpg being
to small.  This does almost what the OP wants -- with a huge disk,
even huge files fit in a few cg's (lots of cg's but few compared with
the total number).  With tunefs -e <maxbpg=bpg>, I think the layout
would be perfectly contiguous except for the skip after the first cg.

My fix is only for the first indirect block, so it doesn't make much
difference for large files.  With the default maxbpg, later indirect
blocks are always allocated in a new cg anyway.  Hopefully the
"primitive" estimate prevents this so that all indirect blocks have
a chance of being allocated contiguously, and other code cooperates
by not moving them.

ffs1, soft updates:

% fs_bsize = 16384
% fs_fsize = 2048
% 5:	lbn 0-11 blkno 34392-34487

For some reason, the file is started later in the first cg.

% 	lbn [<1>indir]12-4107 blkno 34488-34495
% 	lbn 12-4107 blkno 34496-67263

Contiguous.  Without my fix, soft updates seems to move the first indirect
block further away, and thus is noticeably slower.

% 	lbn [<2>indir]4108-16781323 blkno 285592-285599

Soft updates has skipped not just the second cg but the third one too.

% 	lbn [<1>indir]4108-8203 blkno 285600-285607
% 	lbn 4108-6155 blkno 285608-301991
% 	lbn 6156-8203 blkno 377688-394071
% 	lbn [<1>indir]8204-12299 blkno 471736-471743
% 	lbn 8204-10251 blkno 471744-488127
% 	lbn 10252-12299 blkno 565784-582167
% 	lbn [<1>indir]12300-16395 blkno 659832-659839
% 	lbn 12300-12799 blkno 659840-663839

The pattern continues (no more skips).

ffs1, no soft updates, maxbpg = 655360:

% fs_bsize = 16384
% fs_fsize = 2048
% 4:	lbn 0-11 blkno 1520-1615
% 	lbn [<1>indir]12-4107 blkno 1616-1623
% 	lbn 12-4107 blkno 1624-34391
% 	lbn [<2>indir]4108-16781323 blkno 95544-95551
% 	lbn [<1>indir]4108-8203 blkno 95552-95559
% 	lbn 4108-8203 blkno 95560-128327
% 	lbn [<1>indir]8204-12299 blkno 189592-189599
% 	lbn 8204-12299 blkno 189600-222367
% 	lbn [<1>indir]12300-16395 blkno 283640-283647
% 	lbn 12300-12799 blkno 283648-287647

The "primitive" estimate isn't helping -- a new cg is started for every
indirect block.

Bruce