NFS client/buffer cache deadlock

Brian Fundakowski Feldman green at freebsd.org
Wed Apr 20 10:30:33 PDT 2005


On Wed, Apr 20, 2005 at 07:12:20PM +0200, Jilles Tjoelker wrote:
> On Wed, Apr 20, 2005 at 11:52:33AM -0400, Brian Fundakowski Feldman wrote:
> > On Wed, Apr 20, 2005 at 05:35:28PM +0200, Marc Olzheim wrote:
> > > On Wed, Apr 20, 2005 at 11:20:38AM -0400, Brian Fundakowski Feldman wrote:
> > > > > Btw.: I'm not sure write(),writev() and pwrite() are allowed to do short
> > > > > writes on regular files... ?
> 
> > > > Our manpage is incorrect; POSIX states that they are (see earlier
> > > > e-mail).  There really is no alternative -- we simply can't build
> > > > an NFS transaction larger than our buffer cache can accomodate.
> > > > Note that short wries won't happen for normal buffer sizes, only
> > > > excessively large ones.  I really don't believe that writev() is meant
> > > > to be used so that you can write gigantic data structures in a single
> > > > transaction...
> 
> It is ok to return partial success if the first chunk of a large write
> succeeded and a later chunk failed persistently, but not if it cannot be
> performed as a single NFS transaction.

What is your rationale for this?

> > > Ah, I was reading the SUSv2 page:
> 
> > > http://www.opengroup.org/onlinepubs/009695399/functions/write.html
> 
> > > instead of the POSIX version.
> 
> > > But in neither of those I can extrude the fact that it can return
> > > with result < nbyte, without it being a permanent condition.
> > > What phrase makes you conclude that it can ?
> 
> > This specific issue is not clear-cut; the best thing to do lies somewhere
> > within the range of these scenarios:
> 
> > "If a write() requests that more bytes be written than there is room
> > for (for example, [XSI] [Option Start] the process' file size limit
> > or [Option End] the physical end of a medium), only as many bytes as
> > there is room for shall be written. For example, suppose there is
> > space for 20 bytes more in a file before reaching a limit. A write of
> > 512 bytes will return 20. The next write of a non-zero number of bytes
> > would give a failure return (except as noted below)."
> 
> This only applies to permanent conditions.
> 
> > "When attempting to write to a file descriptor (other than a pipe or
> > FIFO) that supports non-blocking writes and cannot accept the data
> > immediately:
> 
> >     * If the O_NONBLOCK flag is clear, write() shall block the calling
> >     thread until the data can be accepted.
> 
> >     * If the O_NONBLOCK flag is set, write() shall not block the
> >     thread. If some data can be written without blocking the thread,
> >     write() shall write what it can and return the number of bytes
> >     written. Otherwise, it shall return -1 and set errno to [EAGAIN]."
> 
> I think regular files do not support non-blocking writes, even if they
> are on NFS; in any case, O_NONBLOCK is disabled by default.

POSIX does not specify O_NONBLOCK semantics for regular files.  This
means we can do whatever is most useful.

> > "[ENOBUFS] Insufficient resources were available in the system to
> > perform the operation."
> 
> > I think the first is more useful behavior than the last.  Supporting it
> > should be exactly the same as supporting what happens if the actual
> > filesystem fills up.  In this case, the filesystem is being requested to
> > write more "than there is room for."
> 
> The filesystem filling up is a totally different case as attempting the
> rest of the write is futile in that case.

No, it isn't.  The filesystem may be not-full again soon, possibly
even what the program might consider "immediately".

> In a lot of code, a short write() is treated as a (fairly) persistent
> error.

I mentioned this several e-mails ago.  Plenty of software is also not
going to understand ENOBUFS.

-- 
Brian Fundakowski Feldman                           \'[ FreeBSD ]''''''''''\
  <> green at FreeBSD.org                               \  The Power to Serve! \
 Opinions expressed are my own.                       \,,,,,,,,,,,,,,,,,,,,,,\


More information about the freebsd-hackers mailing list