add BIO_NORETRY flag, implement support in ata_da, use in ZFS vdev_geom

Poul-Henning Kamp phk at
Sat Nov 25 11:37:35 UTC 2017

In message <DC23D104-F5F3-4844-8638-4644DC9DD411 at>, Scott Long writes:

> Why is overloading EIO so bad?  brelse() will call bdirty() when a BIO_WRITE
> command has failed with EIO.  Calling bdirty() has the effect of retrying the I/O.
> This disregards the fact that disk drivers only return EIO when they’ve decided
> that the I/O cannot be retried.  It has no termination condition for the retries, and
> will endlessly retry I/O in vain; I’ve seen this quite frequently.

The really annoying thing about this particular class of errors,
is that if we propagated them up to the filesystems, very often
things could be relocated to different blocks and we would avoid the
unnecessary filesystem corruption.

The real fundamental deficiency is that we do not have a way to say "give up
if this bio cannot be completed in X time" which is what people actually want.

That is suprisingly hard to provide, there are far too many
corner-cases for me to enumerate them all, but let me just give one

Imagine you issue a deadlined write to a RAID5 thing.  Thee component
writes happen smoothly, but the last two fail the deadline, with
no way to predict how long time it will take before they complete
or fail.

* Does the bio write transaction fail ?

* Does the bio write transaction time out ?

* Do you attempt to complete the write to the RAID5 ?

* Where do you store a copy of the data if you do ?

* What happens next time a read happens on this bio's extent ?

Then for an encore, imagine it was a read bio: Three DMAs go smoothly,
two are outstanding and you don't know if/when they will complete/fail.

* If you fail or time out the bio, how do you "taint" the space
  being read into until the two remaining DMAs are outstanding?

* What if that space is mapped into userland ?

* What if that space is being executed ?

* What if one of the two outstanding DMAs later return garbage ?

My conclusion back when I did GEOM, was that the only way to
do something like this sanely, is to have a special GEOM do it
for you, which always allocates a temp-space:

	allocate temp buffer
	if (write)
		copy write data to temp buffer
	issue bio downwards on temp buffer
	if timeout
		park temp buffer until biodone
	if (read)
		copy temp buffer to read space
	return (ok/error)

Poul-Henning Kamp       | UNIX since Zilog Zeus 3.20
phk at FreeBSD.ORG         | TCP/IP since RFC 956
FreeBSD committer       | BSD since 4.3-tahoe    
Never attribute to malice what can adequately be explained by incompetence.

More information about the freebsd-fs mailing list