quick fix for slow directory shrinking in ffs

Tue May 17 22:39:20 UTC 2016

On Wed, 18 May 2016, Konstantin Belousov wrote:

> On Wed, May 18, 2016 at 05:40:07AM +1000, Bruce Evans wrote:
>> Also, ftruncate() seems to be broken.  POSIX doesn't seem to require it
>> to honor O_SYNC, but POLA requires this.  But there is no VOP_TRUNCATE();
>> truncation is done using VOP_SETATTR() and there is no way to pass down
>> the O_SYNC flag to it; in practice, ffs just does UFS_TRUNCATE() without
>> IO_SYNC.
>>
>> This makes a difference mainly for async mounts with my fixes to honor
>> IO_SYNC in ffs_update().  With async mounts, consistency of the file
>> system is not guaranteed but O_SYNC for a file should at least cause
>> all of the file data and most of its metdata to be written.  Not syncing
>> for ftruncate() unnecessarily loses metadata writes.  With !async mounts,
>> consistency of the file system is partly guaranteed and lost metadata
>> writes for ftruncate() shouldn't affect this -- they should just lose
>> the ftruncate() atomically.
>>
>> vfs could do an fsync() after VOP_SETATTR() for the O_SYNC case.  This
>> reduces the race window.
>
> vattr already has the va_vaflags field.  It is trivial to add flag there
> requesting O_SYNC behaviour.  Of course, other updates could also
> honour VA_SYNC, but this is for later.  Like this:
>
> diff --git a/sys/kern/vfs_vnops.c b/sys/kern/vfs_vnops.c
> index 0a3a88a..1e42a3d 100644
> --- a/sys/kern/vfs_vnops.c
> +++ b/sys/kern/vfs_vnops.c
> @@ -1314,6 +1314,8 @@ vn_truncate(struct file *fp, off_t length, struct ucred *active_cred,
> 	if (error == 0) {
> 		VATTR_NULL(&vattr);
> 		vattr.va_size = length;
> +		if ((fp->f_flag & O_FSYNC) != 0)
> +			vattr.va_vaflags |= VA_SYNC;
> 		error = VOP_SETATTR(vp, &vattr, fp->f_cred);
> 	}
> out:
> diff --git a/sys/sys/vnode.h b/sys/sys/vnode.h
> index e82f6ee..41ec7f7 100644
> --- a/sys/sys/vnode.h
> +++ b/sys/sys/vnode.h
> @@ -286,6 +286,7 @@ struct vattr {
>  */
> #define	VA_UTIMES_NULL	0x01		/* utimes argument was NULL */
> #define	VA_EXCLUSIVE	0x02		/* exclusive create request */
> +#define	VA_SYNC		0x04		/* O_SYNC truncation */
>
> /*
>  * Flags for ioflag. (high 16 bits used to ask for read-ahead and
> diff --git a/sys/ufs/ufs/ufs_vnops.c b/sys/ufs/ufs/ufs_vnops.c
> index c0729f8..83df347 100644
> --- a/sys/ufs/ufs/ufs_vnops.c
> +++ b/sys/ufs/ufs/ufs_vnops.c
> @@ -625,7 +625,8 @@ ufs_setattr(ap)
> 			 */
> 			return (0);
> 		}
> -		if ((error = UFS_TRUNCATE(vp, vap->va_size, IO_NORMAL,
> +		if ((error = UFS_TRUNCATE(vp, vap->va_size, IO_NORMAL |
> +		    ((vap->va_vaflags & VA_SYNC) != 0 ? IO_SYNC : 0),
> 		    cred)) != 0)
> 			return (error);
> 	}

Looks good.

O_SYNC is actually spelled O_FSYNC in FreeBSD.  You spelled it correctly
in the above, but about 2 places in the kernel use the POSIX spelling.
It is confusing enough to also have the spellings FFSYNC and IO_SYNC
for this flag in different layers.  FFSYNC is for fcntl and must equal
O_FSYNC since the layers are not clearly separated.  IO_SYNC is for
a clearly separated layer and is supposed to be translated to, but it
has the same value as O_FSYNC so O_FSYNC might work accidentally when
not translated.

This should probably also be done for truncations with O_TRUNC at open
time.  There are a couple of these in vfs_syscalls.c.  O_TRUNC is used
much more than ftruncate() so the extra overhead from this would be
more noticeable.  I think the implementation is not very good.  If
open() with O_TRUNC or truncate with O_FSYNC or fsync() fails, then
the file contents might be garbage.  So it would be better to do
large truncations mostly async and only sync at the end.  *fs_truncate()
could operate like that, but I think it takes the IO_SYNC flag as a
directive to do the whole operation synchronously. A non-sync truncation
followed by fsync() is likely to work better for ffs and just work for
all fs's.

Bruce