kern/93942: panic: ufs_dirbad: bad dir

Wed Mar 1 12:20:14 PST 2006

The following reply was made to PR kern/93942; it has been noted by GNATS.

From: "David Rhodus" <drhodus at machdep.com>
To: Yarema <yds at coolrat.org>
Cc: FreeBSD-gnats-submit at freebsd.org, FreeBSD-current at freebsd.org, 
	"Kris Kennaway" <kris at obsecurity.org>, 
	"Dennis Koegel" <amf at hobbit.neveragain.de>, 
	"Doug White" <dwhite at gumbysoft.com>, "Martin Machacek" <m at m3a.net>, 
	"David O'Brien" <obrien at freebsd.org>, 
	"Scott Long" <scottl at samsco.org>, 
	"Pawel Jakub Dawidek" <pjd at freebsd.org>
Subject: Re: kern/93942: panic: ufs_dirbad: bad dir
Date: Wed, 1 Mar 2006 15:10:38 -0500

 On 2/28/06, Yarema <yds at coolrat.org> wrote:
 >
 >
 > --On February 28, 2006 2:53:43 PM -0500 Kris Kennaway <kris at obsecurity.or=
 g>
 > wrote:
 >
 > > On Tue, Feb 28, 2006 at 10:35:36AM -0500, Yarema wrote:
 > >>
 > >> > Number:         93942
 > >> > Category:       kern
 > >> > Synopsis:       panic: ufs_dirbad: bad dir
 > >> > Confidential:   no
 > >> > Severity:       critical
 > >> > Priority:       high
 > >> > Responsible:    freebsd-bugs
 > >> > State:          open
 > >> > Quarter:
 > >> > Keywords:
 > >> > Date-Required:
 > >> > Class:          sw-bug
 > >> > Submitter-Id:   current-users
 > >> > Arrival-Date:   Tue Feb 28 15:40:06 GMT 2006
 > >> > Closed-Date:
 > >> > Last-Modified:
 > >> > Originator:     Yarema <yds at CoolRat.org>
 > >> > Release:        FreeBSD 6.1-PRERELEASE i386
 > >> > Organization:
 > >> > Environment:
 > >> System: FreeBSD 6.1-PRERELEASE #0: Mon Feb 27 04:52:11 EST 2006 i386
 > >>
 > >> > Description:
 > >>
 > >> This is at least the third file system which got hosed for me by the
 > >> ufs_dirbad bug on three different hard drives since 5.3 STABLE.
 > >> I suspect this is related to the following PRs:
 > >> http://www.FreeBSD.org/cgi/query-pr.cgi?pr=3D49079
 > >> http://www.FreeBSD.org/cgi/query-pr.cgi?pr=3D51001
 > >>
 > >> In every case a process would lock up making the whole system
 > >> unresponsive.  A reboot, fsck -y in single user mode and another
 > >> reboot would produce the following during the mount of the corrupt
 > >> fs in rw mode:
 > >>
 > >> bad dir ino 2 at  offset 16384: mangled entry
 > >> panic: ufs_dirbad: bad dir
 > >> cpuid =3D 0
 > >>
 > >> Another reboot, fsck -y in single user mode and reboot produces the
 > >> same results repeatedly.  Previously I had recovered by mounting the
 > >> corrupt fs in ro mode, backup, newfs, restore.
 > >>
 > >> Recently I noticed Matthew Dillon commit the following to the
 > >> DragonFly src repository:
 > >>
 > >> http://leaf.DragonFlyBSD.org/mailarchive/commits/2006-02/msg00057.html
 > >>
 > >> dillon      2006/02/21 10:46:56 PST
 > >>
 > >> DragonFly src repository
 > >>
 > >>   Modified files:
 > >>     sys/kern             vfs_cluster.c
 > >>   Log:
 > >>   bioops.io_start() was being called in a situation where the buffer
 > >>   could be brelse()'d afterwords instead of I/O being initiated.  When
 > >>   this occurs, the buffer may contain softupdates-modified data which =
 is
 > >>   never reverted, resulting in serious filesystem corruption.  When
 > >>   io_start is called on a buffer, I/O MUST be initiated and terminated
 > >>   with a biodone() or the buffer's data may not be properly reverted.
 > >>
 > >>   Solve the problem by moving the io_start() call a little further on =
 in
 > >>   the code, after the potential brelse().
 > >>
 > >>   There is a possibility that this bug is responsible for the 'dirbad'
 > >>   panics often reported in DragonFly and FreeBSD circles.
 > >>
 > >>   Revision  Changes    Path
 > >>   1.16      +7 -6      src/sys/kern/vfs_cluster.c
 > >>
 > >> http://www.DragonFlyBSD.org/cvsweb/src/sys/kern/vfs_cluster.c.diff?r1=
 =3D1.
 > >> 15&r2=3D1.16&f=3Du
 > >>
 > >> Below is the equivalent patch to the FreeBSD RELENG_6 branch of
 > >> src/sys/kern/vfs_cluster.c
 > >>
 > >> Hope this helps track down the problem.
 > >
 > > Does it work for you? :)
 > >
 > > Kris
 >
 > No way for me to know yet.  From what I gathered, mostly from this thread=
 :
 > <http://docs.FreeBSD.org/cgi/getmsg.cgi?fetch=3D331058+0+archive/2006/fre=
 ebsd-current/20060108.freebsd-current>
 >
 > As per Matt Dillon
 > <http://docs.FreeBSD.org/cgi/getmsg.cgi?fetch=3D217892+0+/usr/local/www/d=
 b/text/2006/freebsd-current/20060226.freebsd-current>,
 > the corruption occurs much earlier than any consequences can be felt.
 > The patch may prevent the corruption from occurring in the first place.
 > But the patch does nothing for me now that I have a huge /home slice
 > which cannot even be mounted as read-only in single user mode without
 > triggering a page fault kernel panic in the mount process no matter
 > how many times I run fsck -f on it.
 >
 > FWIW the page fault in the mount process is a different sort of kernel
 > panic than what is described in this kern/93942 PR above.  The page fault
 > occurs while attempting to mount read-only.  Attempting to mount raed-wri=
 te
 > causes the panic: ufs_dirbad: bad dir
 >
 > One more note, hitting the power button when the machine is locked up
 > before the reboot and mount attempt which causes the panic produces the
 > following output every time the button is pressed:
 >
 > kernel: acpi: suspend request ignored (not ready yet)
 >
 > Seems like there's two separate problems:
 > 1) the root cause of the bad dir corruption.
 > 2) fsck -f doesn't fix it no matter how many times you run it.
 >
 > Any pointers on how to recover my /home slice will be greatly appreciated=
 .
 >
 > --
 > Yarema

 I have been working with the bad dir problem for several months and I
 have not had corruption which fsck would not correct.

 -DR