ext2 large_file

Mon Oct 31 20:31:45 PST 2005

On Mon, 31 Oct 2005, Ivan Voras wrote:

> On Tue, 1 Nov 2005, Bruce Evans wrote:
>
>> Unless the file system already has or had a large file.  Possible
>> workarounds:
>> (1) Boot Linux and create a large file.  Hopefully e2fsck only sets the
>>    flag so you only have to do this once.
>
> I did this but e2fsck doesn't set the flag. Fortunately, I found out that 
> e2fsprogs includes "debugfs" utility with which I manually set the flag.
>
> It works now!

Does e2fsck report the problem?

> ext2 filesystem access is still a bit slower than with WindowsXP with
> ext2+ext3 IFS driver (~20.5MB/s vs ~25MB/s). The reason I brought up this 
> subject is that I'm experimenting with using ext2 instead of msdosfs for 
> exchanging data between the systems in dual-boot configuration. Because ext2 
> large_file support works now, I think it's much more safer and even somewhat 
> faster (less fragmentation! FreeBSD's msdosfs looks like it's
> pessimized for fragmentation!) to use instead.

Strangely enough, I first got interested in ext2fs under FreeBSD because
testing showed that it was faster than ffs in one configuration, and this
turned out to be mostly because of fragmentation:
- ext2fs under FreeBSD has a primitive block allocator that will give lots
   of fragmentation over the long term but is almost optimal in simple
   tests.  It doesn't really understand cylinder groups and just allocates
   the next free block, so in simple tests that creates files in one process
   and never delete files, the layout is almost optimal.  In particular,
   the layout is good after copying a large directory tree to a new file
   system.  You can see evidence of this using dump2fs -- it shows the
   first few cylinder groups full and the rest unused, where Linux would
   use all the groups fairly evenly.
- ffs at the time had a not very good block allocator that optimized for
   fragmentation of directories (optimized for this == pessimized for
   performance), so it gave very poor peformance for large directory trees
   with small files.  My test was with the Linux src tree.  The FreeBSD
   ports tree would be pessimized more.  This has been fixed.  Now the
   problems in ffs's block allocator are more local.
- my test drive at the time (1997?) didn't have much caching, and this
   interacted badly with ffs's block allocator.  Even for sequentially
   created files, ffs likes to seek backwards to fill in fragments with
   small files, and the drive's cache size of caching algorithm apparently
   didn't like these backwards seeks although they are small.  ffs still
   does this, but drives' caches are now large enough for another physical
   access to usually not be needed to get back to the small files.  ffs's
   other known remaining allocation problems involve not allocating indirect
   blocks sequentially; this problem, or something related, is especially
   large for soft updates -- soft updates takes advantage of its delayed
   block allocation to put indirect blocks further away.  This used to
   cause a 10% performance penalty for a freshly lai out copy of /usr/src,
   but now with bigger drives and cache it is less noticeable.

I use the following to break the optimization for fregmentation in
msdosfs:

% Index: msdosfs_fat.c
% ===================================================================
% RCS file: /home/ncvs/src/sys/fs/msdosfs/msdosfs_fat.c,v
% retrieving revision 1.35
% diff -u -2 -r1.35 msdosfs_fat.c
% --- msdosfs_fat.c	29 Dec 2003 11:59:05 -0000	1.35
% +++ msdosfs_fat.c	26 Apr 2004 05:03:55 -0000
% @@ -68,4 +68,6 @@
%  #include <fs/msdosfs/fat.h>
% 
% +static int fat_allocpolicy = 1;
% +
%  /*
%   * Fat cache stats.
% @@ -759,4 +761,14 @@
%  	if (got)
%  		*got = count;
% +
% +	/*
% +	 * For internal use, cluster pmp->pm_nxtfree is not necessarily free
% +	 * but is the next place to look for a free cluster.  Perhaps this
% +	 * is the correct thing to pass to the next mount too.
% +	 */
% +	pmp->pm_nxtfree = start + count;
% +	if (pmp->pm_nxtfree > pmp->pm_maxcluster)
% +		pmp->pm_nxtfree = CLUST_FIRST;
% +
%  	return (0);
%  }
% @@ -796,9 +808,30 @@
%  		len = 0;
% 
% -	/*
% -	 * Start at a (pseudo) random place to maximize cluster runs
% -	 * under multiple writers.
% -	 */
% -	newst = random() % (pmp->pm_maxcluster + 1);
% +	switch (fat_allocpolicy) {
% +	case 0:
% +		newst = start;
% +		break;
% +	case 1:
% +		newst = pmp->pm_nxtfree;
% +		break;
% +	case 5:
% +		newst = (start == 0 ? pmp->pm_nxtfree : start);
% +		break;
% +	case 2:
% +		/* FALLTHROUGH */
% +	case 3:
% +		if (start != 0) {
% +			newst = fat_allocpolicy == 2 ? start : pmp->pm_nxtfree;
% +			break;
% +		}
% +		/* FALLTHROUGH */
% +	default:
% +		/*
% +		 * Start at a (pseudo) random place to maximize cluster runs
% +		 * under multiple writers.
% +		 */
% +		newst = random() % (pmp->pm_maxcluster + 1);
% +	}
% +
%  	foundl = 0;
%

Only fat_allocpolicy == 1 and its case in the switch statement are
needed here.  The other cases are for testing how bad alternative
simple allocators are.  Policy 1 gives the same primitive sequential
as in Linux -- this works well for copying but not so well when there
is lots of file system activity with multiple concurrent processes.
According to postmark, it is still much better than random allocation
with multiple processes (but more like 2 to 4 times than 10 times).
The fix for advancing pm->pm_nxtfree might not be needed.  IIRC, it
is mostly part of a fix for passing pm_nxtfree across mounts.

With these and some other optimization for msdosfs, and optimizations
and pessimizations for ext2fs, I get access times for a fresh copy
of 75% of /usr/src (all that will fit in VMIO on a system with 1GB --
source always fully cached):

% bde-current writing to IBM IC35L060AVV207-0 h: 24483060 57512700
% tar = tar
% srcs = "contrib crypto lib sys" in /i/src
% ---
% 
% ffs-16384-02048-1:
% tarcp /f srcs:                 50.93 real         0.22 user         6.68 sys
% tar cf /dev/zero srcs:         13.63 real         0.17 user         2.35 sys
% ffs-16384-02048-2:
% tarcp /f srcs:                 45.15 real         0.27 user         6.71 sys
% tar cf /dev/zero srcs:         14.99 real         0.20 user         2.33 sys
% ffs-16384-02048-as-1:
% tarcp /f srcs:                 21.91 real         0.38 user         4.54 sys
% tar cf /dev/zero srcs:         13.82 real         0.21 user         2.30 sys
% ffs-16384-02048-as-2:
% tarcp /f srcs:                 21.08 real         0.34 user         4.64 sys
% tar cf /dev/zero srcs:         15.24 real         0.15 user         2.41 sys
% ffs-16384-02048-su-1:
% tarcp /f srcs:                 42.25 real         0.37 user         4.87 sys
% tar cf /dev/zero srcs:         14.13 real         0.15 user         2.37 sys
% ffs-16384-02048-su-2:
% tarcp /f srcs:                 47.76 real         0.34 user         4.93 sys
% tar cf /dev/zero srcs:         16.25 real         0.16 user         2.38 sys
% 
% ext2fs-1024-1024:
% tarcp /f srcs:                108.68 real         0.30 user         7.99 sys
% tar cf /dev/zero srcs:         41.15 real         0.17 user         5.63 sys
% ext2fs-1024-1024-as:
% tarcp /f srcs:                 81.10 real         0.29 user         7.03 sys
% tar cf /dev/zero srcs:         41.57 real         0.19 user         5.62 sys
% ext2fs-4096-4096:
% tarcp /f srcs:                107.48 real         0.32 user         6.75 sys
% tar cf /dev/zero srcs:         27.37 real         0.08 user         3.00 sys
% ext2fs-4096-4096-as:
% tarcp /f srcs:                 61.87 real         0.34 user         5.72 sys
% tar cf /dev/zero srcs:         27.33 real         0.16 user         2.93 sys
% 
% msdosfs-4096-4096:
% tarcp /f srcs:                 41.53 real         0.48 user         8.69 sys
% tar cf /dev/zero srcs:         16.94 real         0.18 user         4.40 sys

Here the first 2 numbers attached to the fs name are the block and fragment
size; "as" means async mount and "su" means soft updates; the final number
for ffs is for ffs1 vs ffs2.

This shows the following points:
- soft updates (in this test) is now not much faster than ordinary
   (-nosync -noasync) mounts and is much slower than async mounts.  It
   used to be only 1.5 times slower than async mounts.  This test was
   run when bufdaemon was buggier than it is now and showed bufdaemon
   behaving badly under pressure, with only soft updates creating enough
   pressure to cause problems.
- soft updates is still about 5% slower for readback.  My kernel has changes
   to allocate indirect blocks sequentially, but only for ffs1 and I'm not
   sure if the fixes work for soft updates.
- msdosfs is competitive wil non-async ffs provided it uses clustering and
   VMIO as in my version.  However, it cheats to get this -- its most
   important metadata, namely its FAT, is updated using delayed writes
   unless you mount with -sync.  -sync is thus needed to get near the
   same robustness as the default for ffs.
- ext2fs is about twice as slow as the other 2 (worse for non-async writes).
   For async writes, this is partly because -async is not fully implemented.
   It is mostly because the block size is very small, and although this
   only necessarily costs extra CPU to do clustering, FreeBSD is optimized
   for ffs's default block size and does pessimal things with ext2fs's
   smaller sizes.
- Both msdosfs and ffs are as efficiect as can be hoped for for read-back:
   13 to 16 seconds for reading 340MB of small files is 20-25MB/sec.
   This is on a drive with a max transfer rate of 45 or 55 MB/sec and
   not very fast (normal ATA 7200 rpm) seeks.  On active (fragmented) file
   systems you have to be lucky to get half of that.  On my active /usr,
   reading the same files takes 49 seconds.  This is on a drive with a
   max transfer rate of 36MB/sec.

> I propose this patch to the mount_ext2fs manual page:

Someone else will have to look after this.  You might have to file a PR
so that it doesn't get lost.

Bruce