7.2 - ufs2 corruption

Charles Sprickman spork at bway.net
Wed Jul 7 03:56:08 UTC 2010


On Tue, 6 Jul 2010, Kostik Belousov wrote:

> On Mon, Jul 05, 2010 at 05:37:29PM -0400, Charles Sprickman wrote:
>> On Tue, 6 Jul 2010, Kostik Belousov wrote:
>>
>>> On Mon, Jul 05, 2010 at 05:23:03PM -0400, Charles Sprickman wrote:
>>>> Howdy,
>>>>
>>>> I've posted previously about this, but I'm going to give it one more shot
>>>> before I start reformatting and/or upgrading things.
>>>>
>>>> I have a largish filesystem (1.3TB) that holds a few jails, the main one
>>>> being a mail server.  Running 7.2/amd64 on a Dell 2970 with the mfi
>>>> raid card, 6GB RAM, UFS2 (SU was enabled, I disabled it for testing to
>>>> no effect)
>>>>
>>>> The symptoms are as follows:
>>>>
>>>> Various applications will log messages about "bad file descriptors" (imap,
>>>> rsync backup script, quota counter):
>>>>
>>>> du:
>>>> ./cur/1271801961.M21831P98582V0000005BI08E85975_0.foo.net,S=2824:2,S:
>>>> Bad file descriptor
>>>>
>>>> The kernel also starts logging messages like this to the console:
>>>>
>>>> g_vfs_done():mfid0s1e[READ(offset=2456998070156636160, length=16384)]error
>>>> = 5
>>>> g_vfs_done():mfid0s1e[READ(offset=-7347040593908226048,
>>>> length=16384)]error
>>>> = 5
>>>> g_vfs_done():mfid0s1e[READ(offset=2456998070156636160, length=16384)]error
>>>> = 5
>>>> g_vfs_done():mfid0s1e[READ(offset=-7347040593908226048,
>>>> length=16384)]error
>>>> = 5
>>>> g_vfs_done():mfid0s1e[READ(offset=2456998070156636160, length=16384)]error
>>>> = 5
>>>>
>>>> Note that the offsets look a bit... suspicious, especially those negative
>>>> ones.
>>>>
>>>> Usually within a day or two of those "g_vfs_done()" messages showing up
>>>> the box will panic shortly after the daily run.  Things are hosed up
>>>> enough that it is unable to save a dump.  The panic always looks like
>>>> this:
>>>>
>>>> panic: ufs_dirbad: /spool: bad dir ino 151699770 at offset 163920: mangled
>>>> entry
>>>> cpuid = 0
>>>> Uptime: 70d22h56m48s
>>>> Physical memory: 6130 MB
>>>> Dumping 811 MB: 796 780 764 748 732 716 700 684 668 652 636 620 604 588
>>>> 572 556 540 524 508 492 476 460 444 428 412 396 380 364 348 332 316 300
>>>> 284
>>>> ** DUMP FAILED (ERROR 16) **
>>>>
>>>> panic: ufs_dirbad: /spool: bad dir ino 150073505 at offset 150: mangled
>>>> entry
>>>> cpuid = 2
>>>> Uptime: 13d22h30m21s
>>>> Physical memory: 6130 MB
>>>> Dumping 816 MB: 801 785 769 753 737 721 705 689
>>>> ** DUMP FAILED (ERROR 16) **
>>>> Automatic reboot in 15 seconds - press a key on the console to abort
>>>> Rebooting...
>>>>
>>>> The fs, specifically "/spool" (which is where the errors always
>>>> originate), will be pretty trashed and require a manual fsck.  The first
>>>> pass finds/fixes errors, but does not mark the fs clean.  It can take
>>>> anywhere from 2-4 passes to get a clean fs.
>>>>
>>>> The box then runs fine for a few weeks or a few months until the
>>>> "g_vfs_done" errors start popping up, then it's a repeat.
>>>>
>>>> Are there any *known* issues with either the fs or possibly the mfi driver
>>>> in 7.2?
>>>>
>>>> My plan was to do something like this:
>>>>
>>>> -shut down services and copy all of /spool off to the backups server
>>>> -newfs /spool
>>>> -copy everything back
>>>>
>>>> Then if it continues, repeat the above with a 7.3 upgrade before running
>>>> newfs.
>>>>
>>>> If it still continues, then just go nuts and see what 8.0 or 8.1 does.
>>>> But I'd really like to avoid that.
>>>>
>>>> Any tips?
>>>
>>> Show "df -i" output for the the affected filesystem.
>>
>> Here you go:
>>
>> [spork at bigmail ~]$ df -i /spool
>> Filesystem    1K-blocks      Used      Avail Capacity iused     ifree
>> %iused  Mounted on
>> /dev/mfid0s1g 1359086872 70105344 1180254580     6% 4691134 171006784
>> 3%   /spool
>
> I really expected to see the count of inodes on the fs to be bigger
> then 2G. It is not, but it is greater then 1G.
>
> Just to make sure: you do not get any messages from mfi(4) about disk
> errors ?
>
> You could try to format the partition with less inodes, see -i switch
> for the newfs. Make it less then 1G, and try your load again.
>
> The bug with handling volume with >2G inodes was fixed on RELENG_7
> after 7.3 was released. Your simptoms are very similar to what happen
> when the bug is hit.

I have a bit more info... all the memory in this box is in fact ECC, which 
from what I gather is supposed to deal with errors with extra chips that 
store parity info, correct?

Also I ran the megacli consistency check and it came back clean - so I 
think that sort of rules out "bit rot".

At this point I don't see any harm in upgrading, but I'm not sure whether 
I should be looking to 7.3 or 7-STABLE - any pointers?

Perhaps this is helpful, or not...  It crashed again tonight, but it was 
able to dump out a core.

Here's the backtrace - I can't make too much sense of it, but it again 
involves ffs/ufs/vnodes:

# kgdb kernel.debug /var/crash/vmcore.3
GNU gdb 6.1.1 [FreeBSD]
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you 
are welcome to change it and/or distribute copies of it under certain 
conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "amd64-marcel-freebsd"...

Unread portion of the kernel message buffer:
         = 12
panic: page fault
cpuid = 2
Uptime: 15d19h6m52s
Physical memory: 6130 MB
Dumping 796 MB: 781 765 749 733 717 701 685 669 653 637 621 605 589 573 
557 541 525 509 493 477 461 445 429 413 397 381 365 349 333 317 301 285 
269 253 237 221 205 189 173 157 141 125 109 93 77 61 45 29 13

Reading symbols from /boot/kernel/nullfs.ko...Reading symbols from 
/boot/kernel/nullfs.ko.symbols...done.
done.
Loaded symbols for /boot/kernel/nullfs.ko
Reading symbols from /boot/kernel/fdescfs.ko...Reading symbols from 
/boot/kernel/fdescfs.ko.symbols...done.
done.
Loaded symbols for /boot/kernel/fdescfs.ko
#0  doadump () at pcpu.h:195
195             __asm __volatile("movq %%gs:0,%0" : "=r" (td));

(kgdb) bt
#0  doadump () at pcpu.h:195
#1  0x0000000000000004 in ?? ()
#2  0xffffffff8034c799 in boot (howto=260)
     at /usr/src/sys/kern/kern_shutdown.c:418
#3  0xffffffff8034cba2 in panic (fmt=0x104 <Address 0x104 out of bounds>)
     at /usr/src/sys/kern/kern_shutdown.c:574
#4  0xffffffff80574823 in trap_fatal (frame=0xffffff009811f000, 
eva=Variable "eva" is not available.)
     at /usr/src/sys/amd64/amd64/trap.c:757
#5  0xffffffff80574bf5 in trap_pfault (frame=0xffffffff2943b500, 
usermode=0)
     at /usr/src/sys/amd64/amd64/trap.c:673
#6  0xffffffff80575534 in trap (frame=0xffffffff2943b500)
     at /usr/src/sys/amd64/amd64/trap.c:444
#7  0xffffffff8055969e in calltrap ()
     at /usr/src/sys/amd64/amd64/exception.S:209
#8  0xffffffff8050382e in ffs_realloccg (ip=0xffffff0186d69508, lbprev=0,
     bprev=6288224785898156086, bpref=601582184, osize=0, nsize=4096,
     flags=33619968, cred=0xffffff00b234d400, bpp=0xffffffff2943b800)
     at /usr/src/sys/ufs/ffs/ffs_alloc.c:1349
#9  0xffffffff80506e8e in ffs_balloc_ufs2 (vp=0xffffff00852957e0, 
startoffset=Variable "startoffset" is not available.
)  at /usr/src/sys/ufs/ffs/ffs_balloc.c:692
#10 0xffffffff805223e5 in ffs_write (ap=0xffffffff2943ba10)
     at /usr/src/sys/ufs/ffs/ffs_vnops.c:724
#11 0xffffffff805a0645 in VOP_WRITE_APV (vop=0xffffffff80793d20,
---Type <return> to continue, or q <return> to quit---
     a=0xffffffff2943ba10) at vnode_if.c:691
#12 0xffffffff803dd731 in vn_write (fp=0xffffff00548d1a80,
     uio=0xffffffff2943bb00, active_cred=Variable "active_cred" is not 
available.) at vnode_if.h:373
#13 0xffffffff80388768 in dofilewrite (td=0xffffff009811f000, fd=6,
     fp=0xffffff00548d1a80, auio=dwarf2_read_address: Corrupted DWARF 
expression.) at file.h:257
#14 0xffffffff80388a6e in kern_writev (td=0xffffff009811f000, fd=6,
     auio=0xffffffff2943bb00) at /usr/src/sys/kern/sys_generic.c:402
#15 0xffffffff80388aec in write (td=0x1000, uap=0x12d4b9f50)
     at /usr/src/sys/kern/sys_generic.c:318
#16 0xffffffff80596a66 in ia32_syscall (frame=0xffffffff2943bc80)
     at /usr/src/sys/amd64/ia32/ia32_syscall.c:182
#17 0xffffffff80559ad0 in Xint0x80_syscall () at ia32_exception.S:65
#18 0x0000000028241928 in ?? ()

Thanks all,

Charles


More information about the freebsd-fs mailing list