gvirstor & UFS

Wed Mar 28 01:38:50 UTC 2007

On Wed, 28 Mar 2007, Ivan Voras wrote:

> I'm having trouble recovering from "ENOSPC" situation in gvirstor - when
> there's enough space on the virtual device, but not enough physical
> space. No matter what I return to the upper layers (UFS), including EIO,
> it seems to keep on retrying, spitting enormous amounts of messages to
> the kernel log in g_vfs_done (and during this the console is stuck).
> This is the same problems reported by testers some time ago.
>
> Maybe the solution is as simple as sticking a check for ENOSPC somewhere
> in UFS code to make it a special case, but I doubt it's that simple. The
> required behaviour is to, if this condition is reached, drop the current
> IO request (returning appropriate errno to the application) and stop
> retrying (or at least ignore further error messages) until the condition
> is cleared.

The following old patch may help.  vfs retries too hard after write
errors.  Retrying after EIO is bad enough (since most parts of the
kernel still expect the old treatment of not retrying), but retrying
after a non-recoverable error is just a bug.

%%%
Index: vfs_bio.c
===================================================================
RCS file: /home/ncvs/src/sys/kern/vfs_bio.c,v
retrieving revision 1.436
diff -u -2 -r1.436 vfs_bio.c

--- vfs_bio.c	17 Jun 2004 17:16:49 -0000	1.436
+++ vfs_bio.c	17 Apr 2005 05:00:21 -0000
@@ -1222,19 +1312,16 @@
  	s = splbio();

-	if (bp->b_iocmd == BIO_WRITE &&
-	    (bp->b_ioflags & BIO_ERROR) &&
-	    !(bp->b_flags & B_INVAL)) {
+	if (bp->b_iocmd == BIO_WRITE && (bp->b_ioflags & BIO_ERROR) &&
+	    (bp->b_flags & B_INVAL) == 0 &&
+	    (bp->b_error == EIO || bp->b_error == 0)) {
  		/*
  		 * Failed write, redirty.  Must clear BIO_ERROR to prevent
-		 * pages from being scrapped.  If B_INVAL is set then
-		 * this case is not run and the next case is run to 
-		 * destroy the buffer.  B_INVAL can occur if the buffer
-		 * is outside the range supported by the underlying device.
+		 * pages from being scrapped.
  		 */
  		bp->b_ioflags &= ~BIO_ERROR;
  		bdirty(bp);
  	} else if ((bp->b_flags & (B_NOCACHE | B_INVAL)) ||
-	    (bp->b_ioflags & BIO_ERROR) ||
-	    bp->b_iocmd == BIO_DELETE || (bp->b_bufsize <= 0)) {
+	    bp->b_ioflags & BIO_ERROR ||
+	    bp->b_iocmd == BIO_DELETE || bp->b_bufsize <= 0) {
  		/*
  		 * Either a failed I/O or we were asked to free or not
%%%

The main point here is to only redirty if the error is EIO.  Other changes:
- also redirty if the error is 0 (but BIO_ERROR is set).  This case
   shouldn't happen, but if it does then there is no way to determine the
   error type, so play safe and retry.
- fix some style bugs.
- remove the comments about B_INVAL handling.  This is probably wrong.
   However, at least with the check of b_error, the special handling of
   B_INVAL is probably unnecessary -- the code that set B_INVAL can just
   set b_error to something like EINVAL to avoid the redirtying, and this
   might already happen.  Lower layers could also avoid the redirtying
   by setting B_INVAL, but I think they mostly aren't at a level that
   can know when to do this.  I think B_INVAL is set mainly by nfs, and
   nfs can know when to do this better than most places because all its
   layers are combined.

This patch at least used to help in at least one case, where an error
is returned for writes beyond EOF in the case of non-virtual disks.
Old versions of FreeBSD return the bogus errno EINVAL for reads and
writes strictly beyond the end of the disk, so the above helps.  The
current version of FreeBSD returns the even more bogus errno EIO for
reads and writes strictly beyond the end of the disk, so the above
won't help (for non-virtual disks).

Old and current versions of FreeBSD return the bogus non-error of 0
for both reads writes exactly at the end of the disk.  write(2) cannot
return 0, but does due to this error.  dd treats this non-error as EOF
and doesn't retry.

The correct behaviour is to return 0 (EOF/no error) for reads at or
beyond the end of the disk, and ENOSPC for writes at or beyond the end
of the disk, the same as happens for regular files (non-extendible
ones for the case of writes).

For virtual disks, there is some chance of getting the correct error
(ENOSPC) for writes beyond the end and for this error to be passed
up to the vfs layer, so the above would help.

I once debugged error handling near this for vnode-backed md disks:
o a bug in ffs (operating on a file system on the md disk) resulted
   in garbage block numbers (for the backing vnode)
o bounds checking in geom was apparently broken (probably due to an
   overflow bug in geom or md), so the I/O request got as far as ffs
o ffs (operating on the backing vnode) detected the garbage and
   returned EFBIG
o without the above patch, writing of blocks with garbage block numbers
   was retried endlessly.

The ffs bug is fixed now, and if the bounds checking in geom is fixed
too then the I/O request won't get as far as ffs -- g_io_check() will
return the bogus errno of EIO, and the above patch won't help.

Bruce