File remove problem

Sun Dec 2 20:03:43 PST 2007

On Sun, 2 Dec 2007, Don Lewis wrote:

> On  2 Dec, Bruce Evans wrote:
>
>> So it should be safe to remove all the r/o checks in ufs_inactive() after
>> fixing the other bugs.  ffs_truncate alread panics if fs_ronly, but only
>> in some cases.  In particular, it doesn't panic for truncations that don't
>> change the file size.  Such truncations aren't quite null, since standards
>> require [f]truncate(2) to mark the ctime and mtime for update.
>> ffs_truncate() sets the marks, which is correct for null truncations from
>> userland but not ones from syncer internals.  Setting of the marks when
>> fs_ronly is set should cause panics later (my patch has a vprint() for it).
>
> I think the MNT_RDONLY check in ufs_itimes_locked() should be also be
> changed to look at fs_ronly and panic if any marks are set.  This will
> require some changes to add some early MNT_RDONLY checks.

Yes, already done (except vprint() instead of panic).

> In particular, ffs_read() and ffs_extread() need to check MNT_RDONLY in
> addition to MNT_NOATIME (as is already done in vfs_mark_atime()).  This
> also looks like it should be a reasonable optimization for read-only
> file systems that should eliminate unnecessary work at the lower levels
> of the code.

But I let these happen and discard IN_ATIME marks if fs_ronly.  I
thought that the optimization went the other way -- unconditionally
setting the marks was very efficient, and discarding them in ufs_itimes()
was efficient too.  I think this is still true now with larger locking
overheads, and the marks should be discarded later in the MNT_NOATIME
case too.  It is expected that the marks are set much more often than
they are looked at by ufs_itimes(), since most calls to ufs_itimes()
are in close() and read() is much more common than close().  ufs_itimes()
is also called in stat() but I think that is less common than close()
(except for some tree walks).  WIth non-delayed marking, ufs_itimes()
would still have to check fs_ronly, and the only gain would be that
it could then skip checking the marks except as an invariants check.
But it can gain like that even with delayed setting -- just ignore any
old marks while fs_ronly (except as an invariants check), but clear them
at mount or unmount time so that there shouldn't be any.

> The early IN_ACCESS flag setting in ufs_setattr(), before the MNT_RDONLY
> check, appears to be protected by the MNT_RDONLY check in
> vfs_mark_atime().

Thanks, I had forgotten about that.  In vfs_mark_atime(), there is much
more efficiency to be gained by not setting marks that will be discarded,
since it takes a VOP to set them and many file systems don't support
this setting.  However, it is hard for vfs_mark_atime() to know when the
mark will be discarded without calling the fs:

- it already doesn't know which fs's support it
- it should be checking fs_ronly for ffs
- it seems to be missing locking for MNT_NOATIME and MNT_RDONLY

fs-level locking for MNT_NOATIME and MNT_RDONLY and fs_ronly is dubious
too.  Upper layers set the MNT flags before giving VOP_MOUNT() a chance
to adjust the marks.  This is automatically safe in one direction only
(e.g., setting MNT_NOATIME or MNT_RDONLY is fail-safe since it stops
changes), and always bad for strict invariants.

I now use the following fixes:

% Index: ufs_inode.c
% ===================================================================
% RCS file: /home/ncvs/src/sys/ufs/ufs/ufs_inode.c,v
% retrieving revision 1.69
% diff -u -2 -r1.69 ufs_inode.c
% --- ufs_inode.c	22 Jun 2007 13:22:37 -0000	1.69
% +++ ufs_inode.c	2 Dec 2007 13:51:12 -0000
% @@ -90,7 +90,5 @@
%  	ufs_gjournal_close(vp);
%  #endif
% -	if ((ip->i_effnlink == 0 && DOINGSOFTDEP(vp)) ||
% -	    (ip->i_nlink <= 0 &&
% -	     (vp->v_mount->mnt_flag & MNT_RDONLY) == 0)) {
% +	if (ip->i_effnlink <= 0) {
%  	loop:
%  		if (vn_start_secondary_write(vp, &mp, V_NOWAIT) != 0) {

Back out 1.64 == restore 1.40.

Always check i_effnlink so that there is no difference for the soft updates
case.

Use consistent style `<= 0' for things that should be >= 0.

% @@ -120,7 +118,7 @@
%  		}
%  	}
% -	if (ip->i_effnlink == 0 && DOINGSOFTDEP(vp))
% +	if (ip->i_effnlink <= 0 && DOINGSOFTDEP(vp))
%  		softdep_releasefile(ip);
% -	if (ip->i_nlink <= 0 && (vp->v_mount->mnt_flag & MNT_RDONLY) == 0) {
% +	if (ip->i_nlink <= 0) {
%  #ifdef QUOTA
%  		if (!getinoquota(ip))

Back out 1.64 == restore 1.40 (now it's duplicated).

Use consistent style `<= 0' for things that should be >= 0.

% @@ -147,17 +145,7 @@
%  		UFS_VFREE(vp, ip->i_number, mode);
%  	}
% -	if (ip->i_flag & (IN_ACCESS | IN_CHANGE | IN_MODIFIED | IN_UPDATE)) {
% -		if ((ip->i_flag & (IN_CHANGE | IN_UPDATE | IN_MODIFIED)) == 0 &&
% -		    mp == NULL &&
% -		    vn_start_secondary_write(vp, &mp, V_NOWAIT)) {
% -			mp = NULL;
% -			ip->i_flag &= ~IN_ACCESS;
% -		} else {
% -			if (mp == NULL)
% -				(void) vn_start_secondary_write(vp, &mp,
% -								V_WAIT);
% -			UFS_UPDATE(vp, 0);
% -		}
% -	}
% +	if (mp == NULL)
% +		(void) vn_start_secondary_write(vp, &mp, V_WAIT);
% +	UFS_UPDATE(vp, 0);
%  out:
%  	/*

Unrelated change: don't do extra work to break IN_ACCESS while busy
snapshotting.  This is now handled better by transferring IN_ACCESS
to IN_LAZYACCESS elsewhere.  Discarding IN_ACCESS here just breaks
this transfer.  Here I think it was only a hack to handle one call to
UFS_UPDATE(), but there are calls to UFS_UPDATE() all over and the
others caused bugs which were eventually fixed in a better way.

% Index: ufs_vnops.c
% ===================================================================
% RCS file: /home/ncvs/src/sys/ufs/ufs/ufs_vnops.c,v
% retrieving revision 1.293
% diff -u -2 -r1.293 ufs_vnops.c
% --- ufs_vnops.c	8 Nov 2007 17:21:51 -0000	1.293
% +++ ufs_vnops.c	2 Dec 2007 04:56:58 -0000
% @@ -89,4 +89,5 @@
%  #endif
% 
% +#include <ufs/ffs/fs.h>
%  #include <ufs/ffs/ffs_extern.h>
% 
% @@ -137,8 +138,38 @@
% 
%  	ip = VTOI(vp);
% +	/*
% +	 * MNT_RDONLY can barely be trusted here.  Full r/o mode is indicated
% +	 * by fs_ronly, and the MNT_RDONLY setting [should] differ from the
% +	 * fs_ronly setting only during transition from r/w mode to r/o mode.
% +	 * We set IN_ACCESS even in full r/o mode, so we must discard it
% +	 * unconditionally here.  During the transition, we must convert
% +	 * IN_CHANGE | IN_UPDATE to IN_MODIFIED, since some callers neglect
% +	 * to set IN_MODIFIED.  We also set the timestamps indicated by
% +	 * IN_CHANGE | IN_UPDATE normally during the transition, since the
% +	 * update marks may have been set correctly before the transition and
% +	 * not yet converted into timestamps.  Callers that set IN_CHANGE |
% +	 * IN_UPDATE during the transition are buggy, since userland writes
% +	 * are supposed to be denied (by MNT_RDONLY checks) during the
% +	 * transition, while kernel writes should should only be for syncs
% +	 * and syncs should not touch timestamps except to convert old
% +	 * update marks to timestamps.  Callers that set any update mark or
% +	 * modification flag except IN_ACCESS while in full r/o mode are
% +	 * broken; we will panic for them later.
% +	 */
%  	if ((vp->v_mount->mnt_flag & MNT_RDONLY) != 0)
% -		goto out;
% +		ip->i_flag &= ~IN_ACCESS;
%  	if ((ip->i_flag & (IN_ACCESS | IN_CHANGE | IN_UPDATE)) == 0)
%  		return;
% +	if (ip->i_fs->fs_ronly) {	/* XXX locking? */
% +		vprint("ufs_itimes_locked: r/o mod", vp);
% +		/*
% +		 * Should panic here, or return and let ffs_update() panic.
% +		 * The fs_ronly check in ffs_update() is now almost redundant
% +		 * and should not succeed, so it should be replaced by a
% +		 * panic.  It detects more invariants failures than we detect
% +		 * here.
% +		 */
% +		goto out;
% +	}
% 
%  	if ((vp->v_type == VBLK || vp->v_type == VCHR) && !DOINGSOFTDEP(vp))

This essentially backs out 1.280 to restore 1.182.  This shouldn't be
needed (except for invariants checking), but is currently needed to
work around soft updates and other things setting IN_CHANGE when they
should set IN_MODIFIED (and IN_CHANGE only sometimes).

With these patches, everything seems to work perfectly in -current
except for a bug in kqueue with the unlinked open file case which is
the main thing handled by the fix in ufs_inode.c:

     cp /etc/passwd /f/a
     tail -f /f/a &
     rm /f/a
     umount -f /f/a      # or mount -u -o ro /f/a after fixing the bugs

This leaves tail -f waiting in kqueue.

Everything seems to work perfectly in 5.2 without these patches.  kqueue
returns when its open file is forcibly closed in preparation for
forcibly removing it for the forced umount.

Bruce