skipping fsck with soft-updates enabled

Thu Jan 11 18:25:18 UTC 2007

Oliver Fromme wrote:
> Scott Oertel wrote:
>  > I'll probably do some testing with the effects of delaying fsck for long 
>  > periods of time using soft-updates. Personally, I haven't found anyone 
>  > stating any hard facts that would leave me to believe that running on a 
>  > dirty filesystem for an extended period of time won't cause further 
>  > inconsistencies.
>
> s/further//
>
> If soft-updates is running correctly on the drive, then
> there are _no_ inconsistencies on the file systems after
> a crash.  And there's no reason why any inconsistencies
> should appear later on.
>   
I don't believe that you can say that there are "_no_ inconsistencies" 
100% of the time guaranteed.
> The only thing that's "dirty" about the file systems is
> unused space that's still marked as used, which means
> that it is not available to new allocations when writing
> data.  That's not harmful (unless you run out of disk
> space, of course).
>
> The only thing that fsck will do in that situation -- no
> matter whether regular fsck or background fsck -- is to
> find those unused areas and mark them as free.  It does
> not matter whether that's done immediately after the
> reboot, or a few hours later, or a month later, or even
> never at all.  The only drawback is that some disk space
> is unavailable for new allocations until fsck cleans it
> up.
>
> All of the above is theory, and it _should_ work exactly
> like that.  In practice, every non-trivial piece of code
> contains bugs.  In practice, disk drives don't always
> behave as the driver expects:  because of misconfiguration
> (e.g. enabling write-cache on drives without support for
> tagged command queueing), or because of bugs in the firm-
> ware, misunderstanding of the specs, or even intentional
> deviations from the spec by the vendor (which isn't all
> that unusual, unfortunately).
>   
"Theory", this is what makes me a little nervous about putting into 
daily practice on production machines.
> Furthermore, if the crash is caused by hardware failure
> (e.g. power outage, pulling the plug, kicking the hard
> drive, disk head crash etc.), then _no_ piece of software
> can guarantee anything about the state of the filesystem.
> A full, regular fsck (non-background) is required in such
> cases, and even then there is no guarantee that you don't
> have corrupted files.  The problem is that the code doesn't
> seem to be able to detect such cases reliably.  Another
> cause of trouble is when the background fsck is interrupted
> in the middle by another crash.  In my experience that's
> almost always guaranteed to cause serious corruption.
>
>   
^^ This is why you can't guarantee 100% consistency, correct?

>  > Which was what I was hoping to get out of this post, maybe someone will 
>  > read it down the line and provide some real facts of why it is or is not 
>  > dangerous to delay fsck's for an extended period of time.
>
> Well, above I provided some real facts and explained
> some potential risks.  But it's up to yourself to decide
> whether it could be dangerous in your situation or not.
>
> Personally, it's my impression that pjd's new gjournal
> code -- even though it's still considered experimental --
> seems to be more reliable than background fsck.  It
> costs a bit of I/O performance, though, but if you put
> the journal on a dedicated disk, it's not that bad.
>
> Best regards
>    Oliver
>
>   
Thank you for the long and in depth review on my post, the information 
you provided is very useful. I've been hearing all sorts of good things 
about gjournal, so I'm going to put it into a testing environment and 
stress test it a bit. If it proves to be reliable I might place it on 
the machines that are failing frequently, but probably not for a while.


--Scott