quick fix for slow directory shrinking in ffs

Tue May 17 10:26:43 UTC 2016

On Tue, 17 May 2016, Konstantin Belousov wrote:

> On Tue, May 17, 2016 at 07:54:27AM +1000, Bruce Evans wrote:
>> ffs does very slow shrinking of directories after removing some files
>> leaves unused blocks at the end, by always doing synchronous truncation.
>> ...
>> X Index: ufs_lookup.c
>> X ===================================================================
>> X --- ufs_lookup.c	(revision 299263)
>> X +++ ufs_lookup.c	(working copy)
>> X @@ -1131,9 +1131,9 @@
>> X  		if (tvp != NULL)
>> X  			VOP_UNLOCK(tvp, 0);
>> X  		error = UFS_TRUNCATE(dvp, (off_t)dp->i_endoff,
>> X -		    IO_NORMAL | IO_SYNC, cr);
>> X +		    IO_NORMAL | (DOINGASYNC(dvp) ? 0 : IO_SYNC), cr);
>> X  		if (error != 0)
>> X -			vprint("ufs_direnter: failted to truncate", dvp);
>> X +			vprint("ufs_direnter: failed to truncate", dvp);
>> X  #ifdef UFS_DIRHASH
>> X  		if (error == 0 && dp->i_dirhash != NULL)
>> X  			ufsdirhash_dirtrunc(dp, dp->i_endoff);
>
> The IO_SYNC flag, for non-journaled SU and any kind of non-SU mounts,
> only affects the new blocks allocation mode, and write-out mode for
> the last fragment. The truncation itself (for -J) is performed in the
> context of the truncating thread. The cg blocks, after the bits are
> set to free, are marked for delayed write (with the background write
> hack). The inode block is written according to the mount mode, ignoring
> IO_SYNC.

I don't see why you think that.  ffs_truncate() clearly honors IO_SYNC,
and testing shows that ffs with soft updates does precisely 7 extra
sync writes for directory compaction (where some of the 7 are probably
to sync previous activity).

I think it would be wrong to ignore IO_SYNC and use the mount mode for
inodes.  Async mounts still have that bug IIRC (I fixed locally long
ago).  IO_SYNC is set if the file is is open with O_SYNC and mount mode
must not override this.  I think ffs has no way of telling that this
particular IO_SYNC is not associated with O_SYNC.

> That is, for always fully populated directory files, I do not see how
> anything is changed by the patch.

This problem affects all 512-boundaries, which are rarely block or
even fragment boundaries.

Test program:

X mp=$(df . | grep -v Filesystem | sed 's/ .*//')
X echo $mp
X while :;
X do
X 	echo -n "start:   $(/usr/bin/stat -f %5z .); "
X 	mount -v | grep $mp | sed -e 's/.*writes/writes/' -e 's/, reads.*//'
X 
X 	touch $(jot 41 0)	# just over 512 bytes
X 	echo -n "touch:   $(/usr/bin/stat -f %5z .); "
X 	mount -v | grep $mp | sed -e 's/.*writes/writes/' -e 's/, reads.*//'
X 
X 	#
X 	# Async mounts are still broken in -current -- these rm's (but nothing
X 	# else here) cause sync writes (but just 1 for the 2 rm's).
X 	#
X 	# rm 39 40		# just under, but no truncation yet
X 	echo -n "rm:      $(/usr/bin/stat -f %5z .); "
X 	mount -v | grep $mp | sed -e 's/.*writes/writes/' -e 's/, reads.*//'
X 
X 	#
X 	# Another bug in async mounts makes the truncate for the compaction
X 	# triggered by this touch do an async write (with the fix to stop it
X 	# doing a sync write.
X 	#
X 	touch 39		# still under; this creation does the truncation
X 	echo -n "touch 39:$(/usr/bin/stat -f %5z .); "
X 	mount -v | grep $mp | sed -e 's/.*writes/writes/' -e 's/, reads.*//'
X 
X 	sleep 10
X 	echo
X done

I hope this uses a portable enough way to find the mount point.  This must
be run in an empty directory (or you have to adjust the sizes).

Results:
- async mount with fix: 1 sync write per iteration.  A bogus one triggered
   by the rm.  I only fixed this locally.  Remove the rm line so that the
   size stays slightly about 1024 bytes and there are 0 sync writes.
   There is also 1 async write triggered by the truncate.  This is another
   bug in async mounts which I have fixed locally.  All writes for async
   mounts should be delayed unless IO_SYNC forces them to be sync.

- soft updates: 7 sync writes per iteration all triggered by the final
   touch (which triggers the compaction).  Remove the rm line and there
   are again 0 sync writes. Sometimes there are 2-5 async writes between
   the loop iterations.  These might be for the loop too, since there
   are more of them than for async mounts.  (I left daemons running
   whuile testing this on the root file system.  Test on a completely
   idle fs to be sure.)

- no soft updates and no async mount: first touch does 3 sync writes, rm
   does 2 sync, last touch does 4 sync; 0 async writes.

The IO_SYNC for soft updates apparently turns all the previous writes for
the loop into sync ones.  It has to order them and wait for them and there
is no better way to wait than a sync write.  The ordering makes an
unnecessary sync write even more expensive for soft updates than for
other cases.

Some relevant code in ffs_truncate:

Y 	/*
Y 	 * Shorten the size of the file. If the file is not being
Y 	 * truncated to a block boundary, the contents of the
Y 	 * partial block following the end of the file must be
Y 	 * zero'ed in case it ever becomes accessible again because
Y 	 * of subsequent file growth. Directories however are not
Y 	 * zero'ed as they should grow back initialized to empty.
Y 	 */
Y 	offset = blkoff(fs, length);
Y 	if (offset == 0) {
Y 		ip->i_size = length;
Y 		DIP_SET(ip, i_size, length);
Y 	} else {
Y 		lbn = lblkno(fs, length);
Y 		flags |= BA_CLRBUF;
Y 		error = UFS_BALLOC(vp, length - 1, 1, cred, flags, &bp);
Y 		if (error) {
Y 			return (error);
Y 		}
Y 		/*
Y 		 * When we are doing soft updates and the UFS_BALLOC
Y 		 * above fills in a direct block hole with a full sized
Y 		 * block that will be truncated down to a fragment below,
Y 		 * we must flush out the block dependency with an FSYNC
Y 		 * so that we do not get a soft updates inconsistency
Y 		 * when we create the fragment below.
Y 		 */
Y 		if (DOINGSOFTDEP(vp) && lbn < NDADDR &&
Y 		    fragroundup(fs, blkoff(fs, length)) < fs->fs_bsize &&
Y 		    (error = ffs_syncvnode(vp, MNT_WAIT)) != 0)
Y 			return (error);
Y 		ip->i_size = length;
Y 		DIP_SET(ip, i_size, length);
Y 		size = blksize(fs, ip, lbn);
Y 		if (vp->v_type != VDIR)
Y 			bzero((char *)bp->b_data + offset,
Y 			    (u_int)(size - offset));
Y 		/* Kirk's code has reallocbuf(bp, size, 1) here */
Y 		allocbuf(bp, size);
Y 		if (bp->b_bufsize == fs->fs_bsize)
Y 			bp->b_flags |= B_CLUSTEROK;
Y 		if (flags & IO_SYNC)
Y 			bwrite(bp);
Y 		else
Y 			bawrite(bp);
Y 	}

I think we usually arrive here and honor the IO_SYNC flag.  This is correct.
Otherwise, we always do an async write, but that is wrong for async mounts.
Here is my old fix for this:

Z diff -u2 ffs_inode.c~ ffs_inode.c
Z --- ffs_inode.c~	Wed Apr  7 21:22:26 2004
Z +++ ffs_inode.c	Sat Mar 23 01:23:16 2013
Z @@ -345,4 +431,6 @@
Z  		if (flags & IO_SYNC)
Z  			bwrite(bp);
Z +		else if (DOINGASYNC(ovp))
Z +			bdwrite(bp);
Z  		else
Z  			bawrite(bp);

This fix must be sprinkled in most places where there is a bwrite()/
bawrite() decision.

Bruce