Strange behaviour with sappend flag set on ZFS

Markus Gebert markus.gebert at hostpoint.ch
Fri Sep 24 15:46:17 UTC 2010


On 24.09.2010, at 05:16, Bruce Evans wrote:

> On Fri, 24 Sep 2010, Markus Gebert wrote:
> 
>> On 23.09.2010, at 19:38, Michael Naef wrote:
>> 
>>> If I open (2) a file with O_APPEND the fd position pointer is set
>>> to the start of the file. When I then write to an fd on a ufs FS in
>>> FB6.4, it is automatically moved to the end and the bytes are
>>> appended. No problems appear with write to the sappend-flages file.
>>> 
>>> However if I do the same on FB8.1/ZFS, the write fails as "not
>>> permitted".
>> 
>> To me, it seems that the zfs_write() VOP incorrectly uses FAPPEND instead of IO_APPEND when checking the ioflag argument to see if the current write is an appending one, here:
>> 
>> ----
>> 	/*
>> 	 * If immutable or not appending then return EPERM
>> 	 */
>> 	pflags = zp->z_phys->zp_flags;
>> 	if ((pflags & (ZFS_IMMUTABLE | ZFS_READONLY)) ||
>> 	    ((pflags & ZFS_APPENDONLY) && !(ioflag & FAPPEND) &&
>> 	    (uio->uio_loffset < zp->z_phys->zp_size))) {
>> 		ZFS_EXIT(zfsvfs);
>> 		return (EPERM);
>> 	}
>> ----
>> 
>> The function comment only mentions IO_APPEND and even later on in zfs_write() IO_APPEND is used to check wether the offset should be changed to EOF, so FAPPEND really seems to wrong in the above permission check.
> 
> Oops, the file[descriptor] append flags aren't all the same like I said in
> a previous reply.  FAPPEND is identical with O_APPEND and is 8, but IO_APPEND
> is 2.  Here zfs is testing IO_NODELOCKED = 8.

Yes, I checked that before posting and forgot to mention it, sorry for that, could have saved you some time :-). Then again if it had been a "spelling-only bug", my patch wouldn't have changed any behaviour, but instead it did make Michi's test case work properly.


> The use of IO_NODELOCKED is confusing and slightly broken.  It is only
> a flag for vn_rdwr() and a couple of other in-kernel i/o functions.
> These start with the vnode locked and set IO_NODELOCKED to tell vn_rdwr()
> not to lock again.  Userland writes call vn_rdwr() without the vnode
> locked and keep IO_NODELOCKED clear to tell vn_rdwr() to lock the
> vnode.  vn_rdwr() doesn't set it so it is garbage after that -- it remains
> consistent with the vnode lock state for in-kernel writes, but for userland
> writes it is inconsistent with the vnode lock state below the level of
> vn_rdwr().  VOP_WRITE() is called from places other than vn_rdwr(), and
> most or all of these places also fail to set IO_NODELOCKED.  They also
> mostly or always know that they are for in-kernel writes, so they also
> don't check IO_NODELOCKED.
> 
> Thus in the above, the FAPPEND bit is usually clear so the test doesn't
> work (the cases where FAPPEND is set where it should be clear, and
> where ZFS_APPENDONLY is also set, are rare since they only occur for
> in-kernel writes of things like core dumps, and you have other problems
> if you have many append-only files being written to for things like
> core dumps).  My previous mail was incorrect about this point, and
> understated the ways in which the comment doesn't match the code --
> apart from FAPPEND not being the append-flag, the comment combined
> with the reported behaviour and mu confusion about the equality of the
> flags helped me misread the IO_APPEND test backwards.

Makes sense. This explains why the test case failed so realiably although zfs_write() checked the wrong bit.


> There is similar confusion and possibly bugs for IO_UNIT.  Now all
> callers seem to set it, and most VOP_WRITE() functions can't work right
> without it.  E.g., they all need to back out of short writes to satisfy
> the implementation of sys_generic.c:write(); ffs_write() only backs
> out ionly if IO_UNIT is set; it is always set for userland writes so
> there is no problem with write(), but why not remove it as a flag and
> always do atomic writes to simplify things?  Kernel writers are even
> less prepared than userland writers to recover from short writes.
> 
>> CURRENT and STABLE-8 seem to be affected to. The following patch seems to fix it (at least Michi's test case works fine with it):
>> 
>> ----
>> diff -ru ../src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c ./sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c
>> --- ../src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c   2010-05-19 08:49:52.000000000 +0200
>> +++ ./sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c        2010-09-23 23:24:43.549846948 +0200
>> @@ -709,7 +709,7 @@
>>        */
>>       pflags = zp->z_phys->zp_flags;
>>       if ((pflags & (ZFS_IMMUTABLE | ZFS_READONLY)) ||
>> -           ((pflags & ZFS_APPENDONLY) && !(ioflag & FAPPEND) &&
>> +           ((pflags & ZFS_APPENDONLY) && !(ioflag & IO_APPEND) &&
>>           (uio->uio_loffset < zp->z_phys->zp_size))) {
>>               ZFS_EXIT(zfsvfs);
>>               return (EPERM);
> 
> Now I think the not-just-a-spelling error was the only major bug, and this
> fixes it.

Ok. Anyone going to commit the patch?


> There is another minor bug/inconsistency with ffs: in the non-IO_APPEND
> case, ffs requires the offset to == the file size, while the above
> allows it to be >= the file size.


I assume you're talking about the "file has appending-only flag set but IO_APPEND is not set" case (I didn't look at the ufs code).

The lseek(2) man page says:

----
     The lseek() system call allows the file offset to be set beyond the end
     of the existing end-of-file of the file.  If data is later written at
     this point, subsequent reads of the data in the gap return bytes of zeros
     (until data is actually written into the gap).
----

Don't know VFS well enough to tell, but why would appending something with a gap/hole (of zeroes) before it not be considered appending (in the append-only file flag sense)? Unless I missed something (like the offset being modified somewhere else in the kernel before the VOP call), depending on how one answers this question, either UFS or ZFS is right, it seems.


Markus




More information about the freebsd-fs mailing list