UNEXPECTED SOFT UPDATE INCONSISTENCY; RUN fsck MANUALLY

Sat Sep 27 06:44:19 UTC 2008

On Fri, Sep 26, 2008 at 10:35:57PM -0700, Derek Kuli??ski wrote:
> Hello Jeremy,
> 
> Friday, September 26, 2008, 10:14:13 PM, you wrote:
> 
> >> Actually what's the advantage of having fsck run in background if it
> >> isn't capable of fixing things?
> >> Isn't it more dangerous to be it like that? i.e. administrator might
> >> not notice the problem; also filesystem could break even further...
> 
> > This question should really be directed at a set of different folks,
> > e.g. actual developers of said stuff (UFS2 and soft updates in
> > specific), because it's opening up a can of worms.
> 
> > I believe it has to do with the fact that there is much faith given to
> > UFS2 soft updates -- the ability to background fsck allows the user to
> > boot their system and have it up and working (able to log in, etc.) in a
> > much shorter amount of time[1].  It makes the assumption that "everything
> > will work just fine", which is faulty.
> 
> As far as I know (at least ideally, when write caching is disabled)

Re: write caching: wheelies and burn-outs in empty parking lots
detected.

Let's be realistic.  We're talking about ATA and SATA hard disks, hooked
up to on-board controllers -- these are the majority of users.  Those
with ATA/SATA RAID controllers (not on-board RAID either; most/all of
those do not let you disable drive write caching) *might* have a RAID
BIOS menu item for disabling said feature.

FreeBSD atacontrol does not let you toggle such features (although "cap"
will show you if feature is available and if it's enabled or not).

Users using SCSI will most definitely have the ability to disable
said feature (either via SCSI BIOS or via camcontrol).  But the majority
of users are not using SCSI disks, because the majority of users are not
going to spend hundreds of dollars on a controller followed by hundreds
of dollars for a small (~74GB) disk.

Regardless of all of this, end-users should, in no way shape or form,
be expected to go to great lengths to disable their disk's write cache.
They will not, I can assure you.  Thus, we must assume: write caching
on a disk will be enabled, period.  If a filesystem is engineered with
that fact ignored, then the filesystem is either 1) worthless, or 2)
serves a very niche purpose and should not be the default filesystem.

Do we agree?

> the data should always be consistent, and all fsck supposed to be
> doing is to free unreferenced blocks that were allocated.

fsck does a heck of a lot more than that, and there's no guarantee
that's all fsck is going to do on a UFS2+SU filesystem.  I'm under the
impression it does a lot more than just looking for unref'd blocks.

> Wouldn't be possible for background fsck to do that while the
> filesystem is mounted, and if there's some unrepairable error, that
> somehow happen (while in theory it should be impossible) just
> periodically scream on the emergency log level?

The system is already up and the filesystems mounted.  If the error in
question is of such severity that it would impact a user's ability to
reliably use the filesystem, how do you expect constant screaming on
the console will help?  A user won't know what it means; there is
already evidence of this happening (re: mysterious ATA DMA errors which
still cannot be figured out[6]).

IMHO, a dirty filesystem should not be mounted until it's been fully
analysed/scanned by fsck.  So again, people are putting faith into
UFS2+SU despite actual evidence proving that it doesn't handle all
scenarios.

> > It also gives the impression of a journalled filesystem, which UFS2 soft
> > updates are not.  gjournal(8) on the other hand, is, and doesn't require
> > fsck at all[2].
> 
> > I also think this further adds fuel to the "so why are we enabling soft
> > updates by default and using UFS2 as a filesystem again?" fire.  I'm
> > sure someone will respond to this with "So use ZFS and shut up".  *sigh*
> 
> I think the reason for using Soft Updates by default is that it was
> a pretty hard thing to implement, and (at least in theory it supposed
> by as reliable as journaling.

The problem here is that when it was created, it was sort of an
"experiment".  Now, when someone installs FreeBSD, UFS2 is the default
filesystem used, and SU are enabled on every filesystem except the root
fs.  Thus, we have now put ourselves into a situation where said
feature ***must*** be reliable in all cases.

You're also forgetting a huge focus of SU -- snapshots[1].  However, there
are more than enough facts on the table at this point concluding that
snapshots are causing more problems[7] than previously expected.  And
there's further evidence filesystem snapshots shouldn't even be used in
this way[8].

> Also, if I remember correctly, PJD said that gjournal is performing
> much better with small files, while softupdates is faster with big
> ones.

Okay, so now we want to talk about benchmarks.  The benchmarks you're
talking about are in two places[2][3].

The benchmarks pjd@ provided were very basic/simple, which I feel is
good, because the tests were realistic (common tasks people will do).
The benchmarks mckusick@ provided for UFS2+SU were based on SCSI
disks, which is... interesting to say the least.

Bruce Evans responded with some more data[4].

I particularly enjoy this quote in his benchmark: "I never found the
exact cause of the slower readback ...", followed by (plausible)
speculations as to why that is.

I'm sorry that I sound like such a hard-ass on this matter, but there is
a glaring fact that people seem to be overlooking intentionally:

Filesystems have to be reliable; data integrity is focus #1, and cannot
be sacrificed.  Users and administrators *expect* a filesystem to be
reliable.  No one is going to keep using a filesystem if it has
disadvantages which can result in data loss or "waste of administrative
time" (which I believe is what's occurring here).

Users *will* switch to another operating system that has filesystems
which were not engineered/invented with these features in mind.  Or,
they can switch to another filesystem assuming the OS offers one which
performs equally as good/well and is guaranteed to be reliable --
and that's assuming the user wants to spend the time to reformat and
reinstall just to get that.

In the case of "bit rot" (e.g. drive cache going bad silently, bad
cables, or other forms of low-level data corruption), a filesystem is
likely not to be able to cope with this (but see below).

A common rebuttal here would be: "so use UFS2 without soft updates".
Excellent advice!  I might consider it myself!  But the problem is that
we cannot expect users to do that.  Why?  Because the defaults chosen
during sysinstall are to use SU for all filesystems except root.  If SU
is not reliable (or is "reliable in most cases" -- same thing if you ask
me), then it should not be enabled by default.  I think we (FreeBSD)
might have been a bit hasty in deciding to choose that as a default.

Next: a system locking up (or a kernel panic) should result in a dirty
filesystem.  That filesystem should be *fully recoverable* from that
kind of error, with no risk of data loss (but see below).

(There is the obvious case where a file is written to the disk, and the
disk has not completed writing the data from its internal cache to the
disk itself (re: write caching); if power is lost, the disk may not have
finished writing the cache to disk.  In this case, the file is going to
be sparse -- there is absolutely nothing that can be done about this
with any filesystem, including ZFS (to my knowledge).  This situation
is acceptable; nature of the beast.)

The filesystem should be fully analysed and any errors repaired (either
with user interaction or automatically -- I'm sure it depends on the
kind of error) **before** the filesystem is mounted.

This is where SU gets in the way.  The filesystem is mounted and the
system is brought up + online 60 seconds before the fsck starts.  The
assumption made is that the errors in question will be fully recoverable
by an automatic fsck, which as this thread proves, is not always the
case.

ZFS is the first filesystem, to my knowledge, which provides 1) a
reliable filesystem, 2) detection of filesystem problems in real-time or
during scrubbing, 3) repair of problems in real-time (assuming raidz1 or
raidz2 are used), and 4) does not need fsck.  This makes ZFS powerful.

"So use ZFS!"  A good piece of advice -- however, I've already had
reports from users that they will not consider ZFS for FreeBSD at this
time.  Why?  Because ZFS on FreeBSD can panic the system easily due to
kmem exhaustion.  Proper tuning can alleviate this problem, but users do
not want to to have to "tune" their system to get stability (and I feel
this is a very legitimate argument).

Additionally, FreeBSD doesn't offer ZFS as a filesystem during
installation.  PC-BSD does, AFAIK.  So on FreeBSD, you have to go
through a bunch of rigmarole[5] to get it to work (and doing this
after-the-fact is a real pain in the rear -- believe me, I did it this
weekend.)

So until both of these ZFS-oriented issues can be dealt with, some
users aren't considering it.

This is the reality of the situation.  I don't think what users and
administrators want is unreasonable; they may be rough demands, but
that's how things are in this day and age.

Have I provided enough evidence?  :-)

[1]: http://www.usenix.org/publications/library/proceedings/bsdcon02/mckusick/mckusick_html/index.html
[2]: http://lists.freebsd.org/pipermail/freebsd-current/2006-June/064043.html
[3]: http://www.usenix.org/publications/library/proceedings/usenix2000/general/full_papers/seltzer/seltzer_html/index.html
[4]: http://lists.freebsd.org/pipermail/freebsd-current/2006-June/064166.html
[5]: http://wiki.freebsd.org/JeremyChadwick/FreeBSD_7.x_on_a_ZFS_pool
[6]: http://wiki.freebsd.org/JeremyChadwick/ATA_issues_and_troubleshooting
[7]: http://wiki.freebsd.org/JeremyChadwick/Commonly_reported_issues
[8]: http://lists.freebsd.org/pipermail/freebsd-stable/2007-January/032070.html

-- 
| Jeremy Chadwick                                jdc at parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.              PGP: 4BD6C0CB |