NFS client/buffer cache deadlock

Wed Apr 20 21:56:16 PDT 2005

On Wed, 20 Apr 2005, Marc Olzheim wrote:

> On Wed, Apr 20, 2005 at 01:16:05PM -0400, Garrett Wollman wrote:
>> <<On Wed, 20 Apr 2005 16:38:42 +0200, Marc Olzheim <marcolz at stack.nl> said:
>>
>>> Btw.: I'm not sure write(),writev() and pwrite() are allowed to do short
>>> writes on regular files... ?
>>
>> I believe it is the intent of the Standard to prohibit this (a
>> paragraph in the rationale says that short writes can only happen if
>> O_NONBLOCK is set, but this is clearly wrong because the normative
>> text says end-of-medium also results in a short write) but there does
>> not appear to be any language which requires atomic behavior for
>> descriptors other than pipes and FIFOs.
>>
>> As a quality-of-implementation matter, for writes to regular files not
>> to be atomic would be considered surprising.
>>
>> -GAWollman
>
> Could someone from standards comment here ? I believe Garrett is
> right...
> (thread is on -hackers and -current)

I can't see anywhere that POSIX actually requires this.

ffs only backs out short writes in some (most) cases.  I have the
following notes about this:

% Index: ffs_vnops.c
% ===================================================================
% RCS file: /home/ncvs/src/sys/ufs/ffs/ffs_vnops.c,v
% retrieving revision 1.130
% diff -u -2 -r1.130 ffs_vnops.c
% --- ffs_vnops.c	21 May 2004 12:05:48 -0000	1.130
% +++ ffs_vnops.c	29 May 2004 13:16:07 -0000
% @@ -719,4 +730,8 @@
%  	 * we clear the setuid and setgid bits as a precaution against
%  	 * tampering.
% +	 * XXX too late, the tamperer may have opened the file while we
% +	 * were writing the data (or before).

This comment is probably wrong, since locking should prevent concurrent
access.

% +	 * XXX too early, if (error && ioflag & IO_UNIT) then we will
% +	 * unwrite the data.

(1)

%  	 */
%  	if (resid > uio->uio_resid && ap->a_cred && 
% @@ -725,7 +740,12 @@
%  		DIP(ip, i_mode) = ip->i_mode;
%  	}
% +	/* XXX this seems to be misplaced.  Which resid? */
%  	if (resid > uio->uio_resid)
%  		VN_KNOTE(vp, NOTE_WRITE | (extended ? NOTE_EXTEND : 0));
%  	if (error) {
% +		/*
% +		 * XXX should truncate to the last successfully written
% +		 * data if the uiomove() failed.
% +		 */

(2)

%  		if (ioflag & IO_UNIT) {
%  			(void)UFS_TRUNCATE(vp, osize,

In FreeBSD-1, I fixed the bug mentioned in comments (1) and (2) by
backing out short writes in the error == EFAULT case as well as in
the IO_UNIT case (most writes are with IO_UNIT so short ones get backed
out without this).

Now I think that IO_UNIT shouldn't exist and all writes to regular
files should be as atomic as possible, and in particular, short writes
should always be backed out in the above.

IO_UNIT is set in almost all cases.  It is set in vn_write(), so it
is normally set for regular files.  It is set for all (?) writes
generated by the kernel itself.  E.g., it is set for core dumps,
although that use is bogus since core dumps aren't atomic and even
the atomicness that they once had has rotted (*).  I thought that
IO_UNIT was used in all important cases, but I recently noticed that
nfs doesn't seem to use it.  Perhaps that is the bug that initiated
this thread.

Individual file systems barely use IO_UNIT.  ffs's only use of it
is to give low quality behaviour in the above.

Handling of short writes (and reads) is broken for non-regular files
too.  One of my notes about this is:

% Index: sys_generic.c
% ===================================================================
% RCS file: /home/ncvs/src/sys/kern/sys_generic.c,v
% retrieving revision 1.131
% diff -u -2 -r1.131 sys_generic.c
% --- sys_generic.c	5 Apr 2004 21:03:35 -0000	1.131
% +++ sys_generic.c	6 Apr 2004 10:17:12 -0000
% @@ -420,4 +415,5 @@
%  		bwillwrite();
%  	if ((error = fo_write(fp, &auio, td->td_ucred, flags, td))) {
% +		/* XXX short write botch - should && be ||? */
%  		if (auio.uio_resid != cnt && (error == ERESTART ||
%  		    error == EINTR || error == EWOULDBLOCK))

The condition for a short write is (auio.uio_resid != cnt).  When there is
a short write, the amount that was written must be returned.  This
requires ignoring `error', but `error' is only ignored if it is ERESTART,
EINTR or EWOULDBLOCK.  One case that is broken is writing to a tty that
gets hung up after writing something.  Then a short (non-null) write with
error = EIO is returned.  This case is unimportant, since writing the
data at the level of tty.c is no guarantee that it has gone further than
the driver's buffers (possibly several layers of these), let alone that
it has been seen by its intended recipient.  Short writes for tape drives
are probably the most important.

This bug would be more obvious if it happened for regular files.  Then
callers can easily see that the file was extended despite write()
bogusly returning -1.  In fact, my simple test for this failure
(written in 1996) still acts strangely for nfs under my 1 year old
version of -current:

%%%
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

#define SIZE	0x10000

main()
{
     void *buf;
     int n;

     buf = malloc(0x10000);
     n = write(1, buf, 0x20000);
     perror("");
     fprintf(stderr, "write %d\n", n);
     fprintf(stderr, "offset %qd\n", lseek(1, 0ll, SEEK_CUR));
}
%%%

This doesn't show any bugs directly.  write() returns -1 and the offset
is correct (65536).  However, ls shows a file size of 73728 (8192 too
long), and trying to read the extra bytes returns 0 and magically fixes
up the file size to 65536.

(*) On non-atomicity of core dumps:  In FreeBSD-1, the vnode was locked
throughout a core dump, so writing cores was atomic even if there are
several sections, but if one of the writes failed then the previous
successful ones were not backed out so the core is garbage after an
error, and there is no indication of the error to userland.  Now the
vnode is intentionally left unlocked and even natural large blocks
(sections for ELF) are intentionally written non-atomically using
vn_rdwr_inchunks(), so that that the whole file system doesn't tend
to hang while writing cores.  Error handling after a write failure
is no better, and applications can see cores that are garbage because
thet are incomplete.

Bruce