Strange behaviour with sappend flag set on ZFS

Fri Sep 24 03:16:54 UTC 2010

On Fri, 24 Sep 2010, Markus Gebert wrote:

> On 23.09.2010, at 19:38, Michael Naef wrote:
>
>> If I open (2) a file with O_APPEND the fd position pointer is set
>> to the start of the file. When I then write to an fd on a ufs FS in
>> FB6.4, it is automatically moved to the end and the bytes are
>> appended. No problems appear with write to the sappend-flages file.
>>
>> However if I do the same on FB8.1/ZFS, the write fails as "not
>> permitted".
>
> To me, it seems that the zfs_write() VOP incorrectly uses FAPPEND instead of IO_APPEND when checking the ioflag argument to see if the current write is an appending one, here:
>
> ----
> 	/*
> 	 * If immutable or not appending then return EPERM
> 	 */
> 	pflags = zp->z_phys->zp_flags;
> 	if ((pflags & (ZFS_IMMUTABLE | ZFS_READONLY)) ||
> 	    ((pflags & ZFS_APPENDONLY) && !(ioflag & FAPPEND) &&
> 	    (uio->uio_loffset < zp->z_phys->zp_size))) {
> 		ZFS_EXIT(zfsvfs);
> 		return (EPERM);
> 	}
> ----
>
> The function comment only mentions IO_APPEND and even later on in zfs_write() IO_APPEND is used to check wether the offset should be changed to EOF, so FAPPEND really seems to wrong in the above permission check.

Oops, the file[descriptor] append flags aren't all the same like I said in
a previous reply.  FAPPEND is identical with O_APPEND and is 8, but IO_APPEND
is 2.  Here zfs is testing IO_NODELOCKED = 8.

The use of IO_NODELOCKED is confusing and slightly broken.  It is only
a flag for vn_rdwr() and a couple of other in-kernel i/o functions.
These start with the vnode locked and set IO_NODELOCKED to tell vn_rdwr()
not to lock again.  Userland writes call vn_rdwr() without the vnode
locked and keep IO_NODELOCKED clear to tell vn_rdwr() to lock the
vnode.  vn_rdwr() doesn't set it so it is garbage after that -- it remains
consistent with the vnode lock state for in-kernel writes, but for userland
writes it is inconsistent with the vnode lock state below the level of
vn_rdwr().  VOP_WRITE() is called from places other than vn_rdwr(), and
most or all of these places also fail to set IO_NODELOCKED.  They also
mostly or always know that they are for in-kernel writes, so they also
don't check IO_NODELOCKED.

Thus in the above, the FAPPEND bit is usually clear so the test doesn't
work (the cases where FAPPEND is set where it should be clear, and
where ZFS_APPENDONLY is also set, are rare since they only occur for
in-kernel writes of things like core dumps, and you have other problems
if you have many append-only files being written to for things like
core dumps).  My previous mail was incorrect about this point, and
understated the ways in which the comment doesn't match the code --
apart from FAPPEND not being the append-flag, the comment combined
with the reported behaviour and mu confusion about the equality of the
flags helped me misread the IO_APPEND test backwards.

There is similar confusion and possibly bugs for IO_UNIT.  Now all
callers seem to set it, and most VOP_WRITE() functions can't work right
without it.  E.g., they all need to back out of short writes to satisfy
the implementation of sys_generic.c:write(); ffs_write() only backs
out ionly if IO_UNIT is set; it is always set for userland writes so
there is no problem with write(), but why not remove it as a flag and
always do atomic writes to simplify things?  Kernel writers are even
less prepared than userland writers to recover from short writes.

> CURRENT and STABLE-8 seem to be affected to. The following patch seems to fix it (at least Michi's test case works fine with it):
>
> ----
> diff -ru ../src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c ./sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c
> --- ../src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c   2010-05-19 08:49:52.000000000 +0200
> +++ ./sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c        2010-09-23 23:24:43.549846948 +0200
> @@ -709,7 +709,7 @@
>         */
>        pflags = zp->z_phys->zp_flags;
>        if ((pflags & (ZFS_IMMUTABLE | ZFS_READONLY)) ||
> -           ((pflags & ZFS_APPENDONLY) && !(ioflag & FAPPEND) &&
> +           ((pflags & ZFS_APPENDONLY) && !(ioflag & IO_APPEND) &&
>            (uio->uio_loffset < zp->z_phys->zp_size))) {
>                ZFS_EXIT(zfsvfs);
>                return (EPERM);

Now I think the not-just-a-spelling error was the only major bug, and this
fixes it.

There is another minor bug/inconsistency with ffs: in the non-IO_APPEND
case, ffs requires the offset to == the file size, while the above
allows it to be >= the file size.

Bruce