UFS not handling errors correctly

Sun Sep 9 15:11:46 PDT 2007

> Soft updates isn't journalling, so you can't "roll back" an error.  It 
> works by maintaining knowledge of the on-disk state of data and ensuring  
> that it only writes to disk in a suitable order so that the on-disk state 
> is supposed to remain consistent.

I am aware of this, I was speaking generally. The least "committal"
solution being to just panic. The point I was trying to make was that
as long as errors are traditional and simple, as in not being able to
read a particular sector, or a write to a sector failed, aborting all
operations should not lead to corruption since that is exactly what
the filesystem has been designed to prevent (essentially panicing the
machine from the perspective of the on-disk filesystem even if the
system is not actually paniced, such as if the filesystem is unmounted
instead).

> Unfortunately there are many ways in which this can fail, mostly involving 
> external factors violating the assumptions upon which soft updates relies.  
> For example, the data written on disk may not correspond to the data 
> dispatched by soft updates, due to things like write caching in the 
> hardware, write reordering, data corruption, unpredictable disk behaviour 
> during power loss, hardware failure, etc.

I am aware of this too (and paranoid about it).

> Similarly, background fsck assumes that the only filesystem errors it will 
> encounter are those permitted by the soft updates model, which are 
> "benign", i.e. non-fatal and correctable at runtime.  When the state of 
> your disk departs from the realm of these assumptions, bg fsck may not be 
> able to repair the damage.

My thinking was that in simple cases (e.g., say you put UFS on a geom
provider that simulates failure, or the disk has a transient write
failure on some particular sector, etc), unmounting the filesystem (or
remounting read-only) would lead to a filesystem with only expected
(and designed for) inconsistencies - assuming of course that there is
no other issues going on, such as random corruption on the drive or in
the I/O path.

In any case, I was not really looking to get into a debate. I only
commented because my reading of the original post was that of a
potential bug in UFS, rather than lack of understanding that fsck
cannot fix arbitrary errors. As with most such bug reports coming from
a real-life situation, one can never prove that there was not random
corruption along the I/O path or whatever else.

Since I know from personal experience, and my understanding from
previous ML traffic is that it is a known issue, the I/O failure
handling in UFS is not rock solid in terms of system stability; so
taking that a bit further and causing corruption did not seem like a
huge leap (e.g., perhaps continuing with a dependent write even though
the preveious write failed - not unthinkable without being familiar
with the code).

-- 
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller <peter.schuller at infidyne.com>'
Key retrieval: Send an E-Mail to getpgpkey at scode.org
E-Mail: peter.schuller at infidyne.com Web: http://www.scode.org

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 187 bytes
Desc: not available
Url : http://lists.freebsd.org/pipermail/freebsd-fs/attachments/20070909/2dbaedbc/attachment.pgp