kern/68690: write(2) returns wrong vlalue when EFAULT

Wed Jul 7 01:30:24 PDT 2004

The following reply was made to PR kern/68690; it has been noted by GNATS.

From: Bruce Evans <bde at zeta.org.au>
To: KOIE Hidetaka <koie at suri.co.jp>
Cc: freebsd-gnats-submit at FreeBSD.org, freebsd-bugs at FreeBSD.org
Subject: Re: kern/68690: write(2) returns wrong vlalue when EFAULT 
Date: Wed, 7 Jul 2004 18:23:57 +1000 (EST)

 On Mon, 5 Jul 2004, KOIE Hidetaka wrote:

 > >Description:
 > Invoking write(fd, buf, size), if buf has both validand invalid segment,
 > write return -1, but it's file pointer has been advanced.
 > The caller mistakes the pointer stays.

 This is a longstanding bug.  It is more of a problem for non-seekable
 devices since physically written can't be undone, so the best possible
 error handling is to return a short write as specified in all versions
 of POSIX.

 Do you actually see the file pointer advanced?  This may be file system
 dependent.  ffs is supposed to back out of the write, and it does so
 for me.  Output:

 %%%
 pos=0
 write(20480)->-1 (should be 12288)
 write: Bad address
 pos=0 (should be 12288)
 %%%

 This is correct.  We tried to write 20K but the buffer size is only 12K
 and the was an i/o error after 12K.  ffs_write() handled the error by
 truncating the file to its original size and rerstoring the original
 file system, and since the write was at the end of the file this has
 the same result as if the write had failed with EFAULT after 0K.  I
 think there is also no way for other processes to see the write while
 it is in progress, since even if they have mmapped the file they can't
 see beyond its original length.

 To show an actual bug for ffs, modifiy the example slightly so that
 it overwrites existing data.  Then truncating the file has no effect,
 and it is just wrong for write() to return -1 after clobbering some
 data.

 > >Fix:

 (1) Here is ffs_write()'s buggy error handling:

 % 	for (error = 0; uio->uio_resid > 0;) {
 % 		...
 % 		if (uio->uio_offset + xfersize > ip->i_size)
 % 			vnode_pager_setsize(vp, uio->uio_offset + xfersize);

 Not error handling, but general setup.  Hopefully extending the size at the
 vm level doesn't allow other other processes to see up to the new EOF yet.

 %
 %                 /*
 % 		 * We must perform a read-before-write if the transfer size
 % 		 * does not cover the entire buffer.
 %                  */
 % 		if (fs->fs_bsize > xfersize)
 % 			flags |= BA_CLRBUF;
 % 		else
 % 			flags &= ~BA_CLRBUF;

 Security-related code.  Other processes can see buffer contents via
 mmap(), at least after completion of extension of the file at the vfs
 level, so we must clear the buffer contents now.  This also avoids
 potential leakage of unwritten buffer contents to the same process
 after buggy EFAULT handling.

 % 		...
 % 		/*
 % 		 * If the buffer is not valid we have to clear out any
 % 		 * garbage data from the pages instantiated for the buffer.
 % 		 * If we do not, a failed uiomove() during a write can leave
 % 		 * the prior contents of the pages exposed to a userland
 % 		 * mmap().  XXX deal with uiomove() errors a better way.
 % 		 */
 % 		if ((bp->b_flags & B_CACHE) == 0 && fs->fs_bsize <= xfersize)
 % 			vfs_bio_clrbuf(bp);

 More buffer clearing, for another case.  It's not so clear that this is
 done before other processes can see the buffer.

 % 	...
 % 	/*
 % 	 * If we successfully wrote any data, and we are not the superuser
 % 	 * we clear the setuid and setgid bits as a precaution against
 % 	 * tampering.
 % 	 */
 % 	if (resid > uio->uio_resid && ap->a_cred &&
 % 	    suser_cred(ap->a_cred, PRISON_ROOT)) {
 % 		ip->i_mode &= ~(ISUID | ISGID);
 % 		DIP(ip, i_mode) = ip->i_mode;
 % 	}

 Misplaced code.  We are going to back out of the write in some cases, so
 we don't know if we successefully wrote any data here.

 % 	...
 % 	if (error) {
 % 		if (ioflag & IO_UNIT) {
 % 			(void)UFS_TRUNCATE(vp, osize,
 % 			    IO_NORMAL | (ioflag & IO_SYNC),
 % 			    ap->a_cred, uio->uio_td);
 % 			uio->uio_offset -= resid - uio->uio_resid;
 % 			uio->uio_resid = resid;
 % 		}
 % 	} ...

 Here is where we back out of the write in some cases.
 (a) We only back out in the IO_UNIT case, but that is the usual case and
     is always the case for write(2).  (There is another bug suite
     involving IO_UNIT.  Most references to IO_UNIT in the kernel except
     those related to userland i/o don't actually give atomic writes
     since writes are made non-atomic using vfs_rdwr_inchunks() anyway.
     OTOH, there seems to be no reason for ffs to skip the backout in
     the !IO_UNIT case, so there is no need for the IO_UNIT flag to exist.)
 (b) We cannot back out if anything was written before the original end of
     the file.  Truncation just increases the mess in that case, since it
     only reduces the file to its original size, but the file position
     (in uio->uio_offset) is reduced by the full amount.

 Fixes for ffs_write(): ignore IO_UNIT, and don't rewind the file pointer
 below osize.  This gives short for writes that wrote something before
 osize and then failed.

 Similarly for other file systems.

 Here is write(2)'s buggy error handling (in dofilewrite()):

 % 	if ((error = fo_write(fp, &auio, td->td_ucred, flags, td))) {
 % 		if (auio.uio_resid != cnt && (error == ERESTART ||
 % 		    error == EINTR || error == EWOULDBLOCK))
 % 			error = 0;

 (auio.uio_resid != cnt) means that the write was short.  It should be
 returned to userland as a short write by setting error to 0 unconditionally.
 LOwer layers mostly get this write and return a short write if and only
 if they couldn't back out of the write, but they still return a nonzero
 error code to indicate that there was a problem.  The userland interface
 (write(2) in this case) just doesn't support returning both an error code
 and a byte count, so one of them must be discarded, and it must be the
 error code since it is useful and POSIX standard to return the byte count.

 Since ffs attempts to backs out of failing writes, and the backup is
 successful in most cases, this bug mainly affects devices.

 Similarly for read(2) and writev(2), etc.

 Bruce