UFS Filesystem issues, and the loss of my hair...

Hearn, Trevor trevor.hearn at Vanderbilt.Edu
Thu Aug 6 14:21:12 UTC 2009


First off, let me state that I love FreeBSD. I've used it for years, and have not had any major problems with it... Until now.

As you can tell, I work for a major university. I setup a large storage array to hold data for a project they have here. No great shakes, just some standard files and such. The fun started when I started loading users onto the system, and they started using it... Isn't that always the case? Now, I get ufs_dirbad errors, and the system hard locks. This isn't the worst thing that could happen, but when you're talking about file partitions the size that I am using, the fsck takes FOREVER. Somewhere on the order of 1.5 hours. During that time, I am bringing the individual shares/partitions online, but the users suffer. I've asked about this before, in a different forum, but got no usable information that I could see. So, here goes...

The system is as such. A dell 2950 1U server, with a Qlogic Fibre Channel card. It is connected to two Promise Array chassis, 610 series, each with 16 drives. Each chassis is running RAID 6, which gives me about 12.73tb of storage per chassis. From there, the logical drives are sliced up into smaller partitions. At most, I have a 3.6tb partition. The smallest is a 100gig partition.

Filesystem       Size    Used   Avail Capacity  Mounted on
/dev/mfid0s1a    197G     10G    170G     6%    /
devfs            1.0K    1.0K      0B   100%    /dev
/dev/da0p1       1.8T    1.5T    130G    92%    /slice1
/dev/da0p5       2.7T    1.8T    661G    74%    /slice2
/dev/da0p9       250G     21G    209G     9%    /slice3
/dev/da1p3       103G     12G     83G    12%    /slice4
/dev/da1p4       205G     54G    135G    29%    /slice5
/dev/da1p5       103G    7.3G     87G     8%    /slice6
/dev/da1p6       103G     22G     72G    23%    /slice7
etc...

I had to use GPT to setup the partitions, and they are using UFS2 for the filesystem. Now... If that's not fun enough... I have TWO of these creatures, which RSYNC every 4 hours. The secondary system is across campus, and sits idle 99% of the time. Every 4 hours, in a stepped schedule, the primary array syncs to the secondary array. If the primary goes down, I FSCK, and any files that are fried, I bring back across from the secondary and replace them. This has worked OK for a while, but now I am getting Kernel Panics on a regular basis. I've been told to migrate to a different filesystem, but my options are ZFS and using GJOURNAL with UFS, from what I can tell. I need something repeatable, simple, and I need something robust. I have NO idea why I keep getting errors like this, but I imagine it's a cascading effect of other hangs that have caused more corruption.

I'd buy a fella, or gal, a cup of coffee and a pop-tart if they could help a brother out. I have checked out this link:
http://phaq.phunsites.net/2007/07/01/ufs_dirbad-panic-with-mangled-entries-in-ufs/
and decided that I need to give this a shot after hours, but being the kinda guy I am, I need to make sure I am covering all of my bases. 

Anyone got any ideas?

Thanks!

-T



More information about the freebsd-fs mailing list