7.2 filesystem corruption

Fri May 21 08:04:16 UTC 2010

Hello all,

Not sure where to go with this post, I've tried -fs and -scsi previously 
in trying to track down some panics in the softdep stuff.  Perhaps the 
more general audience here can shove me in the right direction.

I have a box (Dell PE 2970) running FreeBSD 7.2/amd-64.  6 GB of ECC RAM, 
and a Dell-branded LSI RAID controller (mpt driver).  It's a mail server 
with the active mail server running in a jail and a test version of same 
running in another jail (qmail/vpopmail/courier on old, 
postfix/pfadmin/dovecot on new).

It passed a few weeks of heavy stress testing where I was putting much 
more load on it using an imap/pop/smtp test suite before going into 
production with only one panic (which happened during a fairly intense 
mstone run) - I figured I was somewhat on the bleeding edge with 7.x 
64-bit at that time, so I was not overly concerned since I've run into 
softdep panics before.  Since then however, there have been a few panics 
in "ufsdirhash_lookup".

When this happens, the box reboots, does a background fsck and does not 
complain about anything.  I decided background fsck was probably not a 
good idea, so I disabled it and manually fsck'd on all subsequent panics. 
The pattern is similar to this example:

** /dev/mfid0s1g
** Last Mounted on /spool
** Phase 1 - Check Blocks and Sizes
UNKNOWN FILE TYPE I=147718184
UNEXPECTED SOFT UPDATE INCONSISTENCY
CLEAR? yes

PARTIALLY ALLOCATED INODE I=147718185
UNEXPECTED SOFT UPDATE INCONSISTENCY

And in phase 2, lots of this:

UNALLOCATED I=152688468 OWNER=root MODE=0 SIZE=0 MTIME=Dec 31 19:00 1969 
NAME=/jails/mailbak.blah.net/home/vpopmail/domains/blah.net/A/spec/Maildir/new/1233549930.73014.blah.bway.net

UNEXPECTED SOFT UPDATE INCONSISTENCY
REMOVE? yes

And in Phase 4, lots of this:

** Phase 4 - Check Reference Counts
UNREF FILE I=147623979  OWNER=88 MODE=100600
SIZE=0 MTIME=Feb  7 00:19 2010
CLEAR? yes

In the manual runs, I tend to run through about 3 or 4 times, since even 
though the filesystem gets marked "clean", another run finds more errors. 
Once I get two clean runs in a row, I let the box boot.

Regardless of how "clean" the fs is, I have consistently seen messages 
like this in my serial console log:

g_vfs_done():mfid0s1g[READ(offset=2456998070156636160, length=16384)]error 
= 5
g_vfs_done():mfid0s1g[READ(offset=2456998070156636160, length=16384)]error 
= 5

On the last run, I also turned off soft updates for good measure.

Now I occasinally get these errors:

g_vfs_done():mfid0s1g[READ(offset=5335388948596480000, length=16384)]error 
= 5
bad block 838May 18 00:29:14 8bigmail kernel: 3pid 24481 (rm), 0uid 0 
inumber 1571657736 on /spoo6l: bad block
76548920427, ino 151657736

In addition, there are some files that now have bizarre flags set, such as 
"schg", "sappnd", "opaque", etc.  Some can be changed, others give a "bad 
file descriptor" error.

I fear the fs is getting more scrambled.

I started to think that I'm probably dealing with two things - some bug in 
64-bit UFS2, plus a perpetually dirty filesystem that causes the box to 
panic, which causes more corruption, and so on.

I do have the option of trying to schedule a huge maintenance window and 
dumping the fs, newfs'ing it, and then restoring it, but it's a tough sell 
and for various reasons I can't put a ton of time into this (anyone that 
knows me, hit me up offlist for a fun story).  I'm also quite concerned 
that fsck is finding and fixing things, but the fs is still obviously not 
quite "right".  In short, how can I ensure this won't happen a week after 
a dump/restore?

So that's the story, here's my questions:

-Is there any interest in tracking down what the nature of the initial 
panic/corruption is?  I know I'm a release behind, but digging through the 
PR database, nothing stuck out as far as softdep, mpt, or dirhash bugs 
that looked similar to what I'm seeing that got fixed in 7.3.

-Where is the most likely place to look for a problem here?  The mpt 
driver?  The megacli utility and the bios utility both claim the array is 
in great shape.  The only fs that ever shows the errors with "g_vfs_done" 
and the nonsensical offsets is the partition where the jails reside.  Or 
is it ufsdirhash thing?  I saw some interesting bug reports, but nothing 
that quite matched.  UFS2/SU itself?

-If I do dump/restore (or pull from backups), should I stick to 7.2 or go 
to 7.3 while I'm working on the box?  Or gamble on 8.0 (where I've oddly 
enough seen much fewer odd thigns of late)?

For reference, here's a few other queries regarding this issue:

http://marc.info/?l=freebsd-stable&m=125901173424554&w=2
http://old.nabble.com/7.2-p4:-panic:-ufsdirhash_lookup:-bad-offset-in-hash-array-td27715632.html

I still have some core dumps sitting here as well.

Any input would be appreciated - I do have more info available, but this 
message is already about twice as long as I'd like it to be.  Hit me up 
with any questions.

Thanks,

Charles