quick fix for slow directory shrinking in ffs

Tue May 17 11:17:25 UTC 2016

On Tue, May 17, 2016 at 08:26:26PM +1000, Bruce Evans wrote:
> On Tue, 17 May 2016, Konstantin Belousov wrote:
> 
> > On Tue, May 17, 2016 at 07:54:27AM +1000, Bruce Evans wrote:
> >> ffs does very slow shrinking of directories after removing some files
> >> leaves unused blocks at the end, by always doing synchronous truncation.
> >> ...
> >> X Index: ufs_lookup.c
> >> X ===================================================================
> >> X --- ufs_lookup.c	(revision 299263)
> >> X +++ ufs_lookup.c	(working copy)
> >> X @@ -1131,9 +1131,9 @@
> >> X  		if (tvp != NULL)
> >> X  			VOP_UNLOCK(tvp, 0);
> >> X  		error = UFS_TRUNCATE(dvp, (off_t)dp->i_endoff,
> >> X -		    IO_NORMAL | IO_SYNC, cr);
> >> X +		    IO_NORMAL | (DOINGASYNC(dvp) ? 0 : IO_SYNC), cr);
> >> X  		if (error != 0)
> >> X -			vprint("ufs_direnter: failted to truncate", dvp);
> >> X +			vprint("ufs_direnter: failed to truncate", dvp);
> >> X  #ifdef UFS_DIRHASH
> >> X  		if (error == 0 && dp->i_dirhash != NULL)
> >> X  			ufsdirhash_dirtrunc(dp, dp->i_endoff);
> >
> > The IO_SYNC flag, for non-journaled SU and any kind of non-SU mounts,
> > only affects the new blocks allocation mode, and write-out mode for
> > the last fragment. The truncation itself (for -J) is performed in the
> > context of the truncating thread. The cg blocks, after the bits are
> > set to free, are marked for delayed write (with the background write
> > hack). The inode block is written according to the mount mode, ignoring
> > IO_SYNC.
> 
> I don't see why you think that.  ffs_truncate() clearly honors IO_SYNC,
> and testing shows that ffs with soft updates does precisely 7 extra
> sync writes for directory compaction (where some of the 7 are probably
> to sync previous activity).
ffs_truncate() completely syncs the vnode for non-J truncations.
I enumerated bits which are written according to the flags, and it
seems to be aligned with what you wrote below.

> 
> I think it would be wrong to ignore IO_SYNC and use the mount mode for
> inodes.  Async mounts still have that bug IIRC (I fixed locally long
> ago).  IO_SYNC is set if the file is is open with O_SYNC and mount mode
> must not override this.  I think ffs has no way of telling that this
> particular IO_SYNC is not associated with O_SYNC.
> 
> > That is, for always fully populated directory files, I do not see how
> > anything is changed by the patch.
> 
> This problem affects all 512-boundaries, which are rarely block or
> even fragment boundaries.
Yes, the write-outs of the blocks or fragments at the new end of the
file are not needed if the buffer is clear and not newly allocated.
But they are performed unconditionally.

> 
> The IO_SYNC for soft updates apparently turns all the previous writes for
> the loop into sync ones.  It has to order them and wait for them and there
> is no better way to wait than a sync write.  The ordering makes an
> unnecessary sync write even more expensive for soft updates than for
> other cases.
> 
> Some relevant code in ffs_truncate:
> 
> Y 	/*
> Y 	 * Shorten the size of the file. If the file is not being
> Y 	 * truncated to a block boundary, the contents of the
> Y 	 * partial block following the end of the file must be
> Y 	 * zero'ed in case it ever becomes accessible again because
> Y 	 * of subsequent file growth. Directories however are not
> Y 	 * zero'ed as they should grow back initialized to empty.
> Y 	 */
> Y 	offset = blkoff(fs, length);
> Y 	if (offset == 0) {
> Y 		ip->i_size = length;
> Y 		DIP_SET(ip, i_size, length);
> Y 	} else {
> Y 		lbn = lblkno(fs, length);
> Y 		flags |= BA_CLRBUF;
> Y 		error = UFS_BALLOC(vp, length - 1, 1, cred, flags, &bp);
> Y 		if (error) {
> Y 			return (error);
> Y 		}
> Y 		/*
> Y 		 * When we are doing soft updates and the UFS_BALLOC
> Y 		 * above fills in a direct block hole with a full sized
> Y 		 * block that will be truncated down to a fragment below,
> Y 		 * we must flush out the block dependency with an FSYNC
> Y 		 * so that we do not get a soft updates inconsistency
> Y 		 * when we create the fragment below.
> Y 		 */
> Y 		if (DOINGSOFTDEP(vp) && lbn < NDADDR &&
> Y 		    fragroundup(fs, blkoff(fs, length)) < fs->fs_bsize &&
> Y 		    (error = ffs_syncvnode(vp, MNT_WAIT)) != 0)
> Y 			return (error);
> Y 		ip->i_size = length;
> Y 		DIP_SET(ip, i_size, length);
> Y 		size = blksize(fs, ip, lbn);
> Y 		if (vp->v_type != VDIR)
> Y 			bzero((char *)bp->b_data + offset,
> Y 			    (u_int)(size - offset));
> Y 		/* Kirk's code has reallocbuf(bp, size, 1) here */
> Y 		allocbuf(bp, size);
> Y 		if (bp->b_bufsize == fs->fs_bsize)
> Y 			bp->b_flags |= B_CLUSTEROK;
> Y 		if (flags & IO_SYNC)
> Y 			bwrite(bp);
> Y 		else
> Y 			bawrite(bp);
> Y 	}
> 
> I think we usually arrive here and honor the IO_SYNC flag.  This is correct.
> Otherwise, we always do an async write, but that is wrong for async mounts.
> Here is my old fix for this:
> 
> Z diff -u2 ffs_inode.c~ ffs_inode.c
> Z --- ffs_inode.c~	Wed Apr  7 21:22:26 2004
> Z +++ ffs_inode.c	Sat Mar 23 01:23:16 2013
> Z @@ -345,4 +431,6 @@
> Z  		if (flags & IO_SYNC)
> Z  			bwrite(bp);
> Z +		else if (DOINGASYNC(ovp))
> Z +			bdwrite(bp);
> Z  		else
> Z  			bawrite(bp);
> 
> This fix must be sprinkled in most places where there is a bwrite()/
> bawrite() decision.
No, I do not think that it would be correct for SU mounts.  It is essential
for the correct operation of e.g. ffs_indirtrunc() that writes for SU
case are synchronous, since no dependencies on the indirect block updates
are recorded. The fact that syncvnode() is done before is similarly
important, because no existing dependencies are cleared.

On the other hand, I agree with the note that the final ffs_update()
must honour IO_SYNC requests.

Anyway, my point was that your patch does not change the hardest source
of sync writes, only the write of the final block.  I will commit the
following.

diff --git a/sys/ufs/ffs/ffs_inode.c b/sys/ufs/ffs/ffs_inode.c
index 0202820..50b456b 100644
--- a/sys/ufs/ffs/ffs_inode.c
+++ b/sys/ufs/ffs/ffs_inode.c
@@ -610,7 +610,7 @@ extclean:
 		softdep_journal_freeblocks(ip, cred, length, IO_EXT);
 	else
 		softdep_setup_freeblocks(ip, length, IO_EXT);
-	return (ffs_update(vp, !DOINGASYNC(vp)));
+	return (ffs_update(vp, (flags & IO_SYNC) != 0 || !DOINGASYNC(vp)));
 }
 
 /*
diff --git a/sys/ufs/ufs/ufs_lookup.c b/sys/ufs/ufs/ufs_lookup.c
index 43b4e5c..53536ff 100644
--- a/sys/ufs/ufs/ufs_lookup.c
+++ b/sys/ufs/ufs/ufs_lookup.c
@@ -1131,7 +1131,7 @@ ufs_direnter(dvp, tvp, dirp, cnp, newdirbp, isrename)
 		if (tvp != NULL)
 			VOP_UNLOCK(tvp, 0);
 		error = UFS_TRUNCATE(dvp, (off_t)dp->i_endoff,
-		    IO_NORMAL | IO_SYNC, cr);
+		    IO_NORMAL | (DOINGASYNC(dvp) ? 0 : IO_SYNC), cr);
 		if (error != 0)
 			vprint("ufs_direnter: failed to truncate", dvp);
 #ifdef UFS_DIRHASH