ext2fs now extremely slow

Tue Sep 28 20:19:21 UTC 2010

On Wed, 29 Sep 2010, Bruce Evans wrote:

> For benchmarks on ext2fs:
>
> Under FreeBSD-~5.2 rerun today:
> untar:     59.17 real
> tar:       19.52 real
>
> Under -current run today:
> untar:    101.16 real
> tar:      172.03 real
>
> So, -current is 8.8 times slower for tar, but only 1.7 times slower for
> untar.
> ...
> dumpe2fs seems to show a bizarre layout:
> % ...
> % Group 3: (Blocks 98304-131071)
> %   Backup superblock at 98304, Group descriptors at 98305-98305
> %   Block bitmap at 98306 (+2), Inode bitmap at 98307 (+3)
> %   Inode table at 98308-98816 (+4)
> %   6882 free blocks, 16288 free inodes, 0 directories
> %   Free blocks: 123207, 123209-123215, 123217-123223, 123225-123231, 
> 123233-123239, 123241-123247, ...
>
> The last line was about 15000 characters long, and seems to have the 
> following
> pattern except for the first free block:
>
>    1 block used (12208)
>    7 blocks free (123209-123215)
>    1 block used (12216)
>    7 blocks free (123217-123223)
>    1 block used ...
>    7 blocks free ...
>
> So it seems that only 1 block in every 8 is used, and there is a seek
> after every block.  This asks for an 8-fold reduction in throughput,
> and it seems to have got that and a bit more for reading although not
> for writing.  Even (or especially) with perfect hardware, it must give
> an 8-fold reduction.  And it is likely to give more, since it defeats
> vfs clustering by making all runs of contiguous blocks have length 1.
>
> Simple sequential allocation should be used unless the allocation policy
> and implementation are very good.

This work a bit better after zapping the 8-fold way:

% Index: ext2_alloc.c
% ===================================================================
% RCS file: /home/ncvs/src/sys/fs/ext2fs/ext2_alloc.c,v
% retrieving revision 1.2
% diff -u -2 -r1.2 ext2_alloc.c
% --- ext2_alloc.c	1 Sep 2010 05:34:17 -0000	1.2
% +++ ext2_alloc.c	28 Sep 2010 19:12:46 -0000
% @@ -1,2 +1,5 @@
% +int bde_blkpref = 0;
% +int bde_alloc8 = 1;
% +
%  /*-
%   *  modified for Lites 1.1
% @@ -542,6 +545,12 @@
%  	   then set the goal to what we thought it should be
%  	*/
% +if (bde_blkpref == 0) {
%  	if(ip->i_next_alloc_block == lbn && ip->i_next_alloc_goal != 0)
%  		return ip->i_next_alloc_goal;
% +} else if (bde_blkpref == 1) {
% +	if(ip->i_next_alloc_block == lbn)
% +		return ip->i_next_alloc_goal;
% +} else
% +	return 0;
% 
%  	/* now check whether we were provided with an array that basically
% @@ -662,4 +671,5 @@
%  	 * block.
%  	 */
% +if (bde_alloc8 == 0) {
%  	if (bpref)
%  		start = dtogd(fs, bpref) / NBBY;
% @@ -679,4 +689,5 @@
%  		}
%  	}
% +}
% 
%  	bno = ext2_mapsearch(fs, bbp, bpref);

This gives an improvement of:

untar:    101.16 real -> 63.46
tar:      172.03 real -> 50.70

Now -current is only 1.1 times slower for untar and 2.6 times slower for
tar.

There must be a problem with bpref for things to have been so bad.  There
is some point to leaving a gap of 7 blocks for expansion, but the gap was
left even between blocks in a single file.

I don't have a userland program for displaying the layout produced by
ext2fs, but I have kernel printfs for it several foofs_bmaparray()
functions.  Turning this on for ext2fs gives for 3 files:

% ino 231895: size 99982(25), lbn 0, bn 3704960-3704967, indir 913288-913295, runp 0
% ino 231895: size 99982(25), lbn 1, bn 913200-913287, indir 913288-913295, runp 10
% ino 231895: size 99982(25), lbn 12, bn 913296-913399, indir 913288-913295, runp 12

25 is the number of 4K blocks.  These should be allocated contiguously,
except for an indirect block in the middle.  (ffs also gets this wrong,
by allocating the indirect block far away.)  The above and the below
show bn's for the lbn 0's all nearby.  Then in all cases, the bn for
lbn 1 is far away.  For lbn1-lbn<end>, the allocation is perfectly
contiguous, except for the indirect block in the correct place in the
middle.

% ino 231880: size 82877(21), lbn 0, bn 3704848-3704855, indir 912224-912231, runp 0
% ino 231880: size 82877(21), lbn 1, bn 912136-912223, indir 912224-912231, runp 10
% ino 231880: size 82877(21), lbn 12, bn 912232-912303, indir 912224-912231, runp 8
% ino 231881: size 82343(21), lbn 0, bn 3704856-3704863, indir 912392-912399, runp 0
% ino 231881: size 82343(21), lbn 1, bn 912304-912391, indir 912392-912399, runp 10
% ino 231881: size 82343(21), lbn 12, bn 912400-912471, indir 912392-912399, runp 8

Same pattern for all files examined.  The last 2 have sequential ino's and
were probably created sequentially.  Everything is perfectly sequential
except for jumping back and forth between lbn0 and lbn1.  Perhaps bpref
(and/or the 'goal' variable) is working as intended to keep the lbn0's
together, but something fails so the bn's for all other lbn's are allocated
sequentially starting from the beginning of the disk (912K is much smaller
than 3704K).  Cylinder groups can't be working right either.

I haven't tried the bde_blkpref hack in the above.  It should kill bpref
completely so that there is no jump between lbn0 and lbn1, and break
cylinder group based allocation even better.  Setting bde_blkpref to 1
restores the bug that was present in ext2fs in FreeBSD between 1995 and
2010.  This bug gave seqential allocation starting at the beginning of
the disk in almost all cases, so map searches were slow and early groups
filled up before later groups were used at all.

Bruce