volume management

Tue Apr 10 17:42:35 UTC 2007

On 04/10/07 12:26, Rick C. Petty wrote:
> On Tue, Apr 10, 2007 at 06:21:29PM +0200, Pawel Jakub Dawidek wrote:
>> The choice you have currently is to panic and lost few last seconds of
>> your data, but keep file system in a consistent state, or to return
> 
> How can you guarantee the FS is consistent at that point?  Are you looking
> through the list of blocks to be written?  Granted, with soft updates this
> is less risky, because presumably the metadata blocks haven't been written
> until the data blocks are.
> 
>> ENOSPC which nobody is going to handle and which may at the end corrupt
>> your file system to a state that fsck won't be able to fix it.
> 
> Is a file system thread waiting on the block to be written, or because it's
> in a write cache is the caller lost forever?  I thought the UFS soft
> updates code was blocking on the write, even though the userland caller had
> a successful return.  If so, the FS should handle the error and avoid
> inconsistencies.
> 
> I certainly see this type of behavior in gvinum when a disk is lost and a
> write to a slice cannot finish successfully.  I'm very glad the box doesn't
> panic as often because I can sometimes go in and bring the drive back up.
> 
>> This is not about simple write operation to the disk. Those operations
>> are delayed anyway, your userland process will see the write operation
>> succeeded. This is about kernel and file system consistency.
> 
> I'm aware of that, but what's the call stack leading up to the GEOM
> failure?  I was under the impression that UFS was blocked waiting for a
> write operation, which is all done in the kernel anyway.

I think the issue is that UFS doesn't expect to see ENOSPC from the 
storage, since it believes it's on a provider that should be big enough. 
  Is the right thing to teach UFS to recognize ENOSPC, and pass that on 
to the userland?

>> It will be
>> great to just fix everything in the kernel to handle errors properly,
>> but good luck with that.
> 
> That's a worthy goal and something we should be pursuing.  After all,
> FreeBSD used to be noted for its stability.  I wouldn't call panics a sign
> of stability..  You're better off invalidating all the geom consumers and
> leaving the rest of the system up so an admin can try to recover critical
> data, or so the remaining geom providers can continue to function.

There's been talk in the past about making the mount read-only instead 
of a panic in some situations, but I know nothing more than that.

Eric