7.2 - ufs2 corruption

Mon Jul 5 21:23:05 UTC 2010

Howdy,

I've posted previously about this, but I'm going to give it one more shot 
before I start reformatting and/or upgrading things.

I have a largish filesystem (1.3TB) that holds a few jails, the main one 
being a mail server.  Running 7.2/amd64 on a Dell 2970 with the mfi 
raid card, 6GB RAM, UFS2 (SU was enabled, I disabled it for testing to 
no effect)

The symptoms are as follows:

Various applications will log messages about "bad file descriptors" (imap, 
rsync backup script, quota counter):

du:
./cur/1271801961.M21831P98582V0000005BI08E85975_0.foo.net,S=2824:2,S:
Bad file descriptor

The kernel also starts logging messages like this to the console:

g_vfs_done():mfid0s1e[READ(offset=2456998070156636160, length=16384)]error = 5
g_vfs_done():mfid0s1e[READ(offset=-7347040593908226048, length=16384)]error = 5
g_vfs_done():mfid0s1e[READ(offset=2456998070156636160, length=16384)]error = 5
g_vfs_done():mfid0s1e[READ(offset=-7347040593908226048, length=16384)]error = 5
g_vfs_done():mfid0s1e[READ(offset=2456998070156636160, length=16384)]error = 5

Note that the offsets look a bit... suspicious, especially those negative 
ones.

Usually within a day or two of those "g_vfs_done()" messages showing up 
the box will panic shortly after the daily run.  Things are hosed up 
enough that it is unable to save a dump.  The panic always looks like 
this:

panic: ufs_dirbad: /spool: bad dir ino 151699770 at offset 163920: mangled 
entry
cpuid = 0
Uptime: 70d22h56m48s
Physical memory: 6130 MB
Dumping 811 MB: 796 780 764 748 732 716 700 684 668 652 636 620 604 588 
572 556 540 524 508 492 476 460 444 428 412 396 380 364 348 332 316 300 
284
** DUMP FAILED (ERROR 16) **

panic: ufs_dirbad: /spool: bad dir ino 150073505 at offset 150: mangled 
entry
cpuid = 2
Uptime: 13d22h30m21s
Physical memory: 6130 MB
Dumping 816 MB: 801 785 769 753 737 721 705 689
** DUMP FAILED (ERROR 16) **
Automatic reboot in 15 seconds - press a key on the console to abort
Rebooting...

The fs, specifically "/spool" (which is where the errors always 
originate), will be pretty trashed and require a manual fsck.  The first 
pass finds/fixes errors, but does not mark the fs clean.  It can take 
anywhere from 2-4 passes to get a clean fs.

The box then runs fine for a few weeks or a few months until the 
"g_vfs_done" errors start popping up, then it's a repeat.

Are there any *known* issues with either the fs or possibly the mfi driver 
in 7.2?

My plan was to do something like this:

-shut down services and copy all of /spool off to the backups server
-newfs /spool
-copy everything back

Then if it continues, repeat the above with a 7.3 upgrade before running 
newfs.

If it still continues, then just go nuts and see what 8.0 or 8.1 does. 
But I'd really like to avoid that.

Any tips?

Thanks,

Charles