fsck reports errors on clean filesystem (mounted rw)

Sun Sep 12 21:18:50 UTC 2010

> From owner-freebsd-questions at freebsd.org  Fri Sep 10 17:51:18 2010
> From: cronfy <cronfy at gmail.com>
> Date: Sat, 11 Sep 2010 02:27:46 +0400
> To: freebsd-questions <freebsd-questions at freebsd.org>
> Subject: fsck reports errors on clean filesystem (mounted rw)
>
> Hello.
>
> I ran fsck on my filesystems while system was running (partitons were
> mounted rw with moderate FS usage). fsck reported there were errors
> (INCORRECT BLOCK COUNT and others). I decided to reboot to single mode
> and check all filesystems. But in single mode fsck did not find any
> errors.
>
>  1. Can I be sure my filesystem is consistent?

yes.

>  2. If fsck reports nonexistent errors (and probably will try to fix
> them if asked), isn't it even danger to run fsck on running system?

They're not non-existant. and they're _not_ "errors".  they are 
*EXPECTED* inconsistancies in the _disk-based_ copies of the file-system
meta-data because the 'current' (memory-resident) data is *not* written
to disk at the instant the meta-data changes.

It is a 'non-issue', because the O/S 'knows what it's doing

There are exactly _four_ possible causes of file-system inconsistencies.
  1) You can have an unexpected loss of power, where the CPU stops working
     before it as time to write the above-mentioned 'memory-resident' data 
     to disk.  There are  sub-classes of tis event, to distinguish between
     A utility company outage, somebody accidentally 'pulling the plug', be
     it litterally, or the power on/off switch, and somebody itting the 
     'reset' button.  They all ave te same effect,  the processor can't
     get te 'current' data in memory out to the disk.
  2) you can hve a catastropic O/S failure -- a system 'crash' -- were the
     O/S has discovered an internal inconsistency. _IT_ doesn't trust its
     own data enough to keep running, and takes 'the lesser of two evils'
     route of *not* writing "known to be suspect" data over the out-of-date
     data on the disk.
  3) 'bit rot' on the phyiscal media itself.  Where what gets read back is
     *not* what was written there earlier.  Modern disk drives detect this
     inside the controller and use embedded ECC info to give the 'right'
     data back, while alerting that the problem exists.
  4) "Hardware failures" of any of a variety of sorts -- flakey power supply,
     bad RAM memory, failing controller cipes, etc.

Cause 1) can be virtually eliminated by 'good practices', and the use of a UPS
with controlled automatic shutdown
Cause 3) you can 'stay hread of' by monitoring system logs for 'corrected'
errors on magnetic media.
Causes 2) and 4) you can't do much about.

With the exception of cause 3) -everything- leads to sysem crash which
results in the 'preserved' data being inconsistent.  The 'good news' is
that you *know* it happened, and can run the fixit software (fsck) before
letting users back on.