UFS Filesystem issues, and the loss of my hair...

Mon Aug 10 19:31:26 UTC 2009

To the FreeBSD-FS group at large...

Well, I've spent alot of time looking this one over... I setup a share on a webserver to put up redacted images of the errors I am getting. They are here:

http://www.trevorhearn.com/Array/IMG_0056.jpg
http://www.trevorhearn.com/Array/IMG_0061.jpg
http://www.trevorhearn.com/Array/IMG_0063.jpg
http://www.trevorhearn.com/Array/IMG_0065.jpg
http://www.trevorhearn.com/Array/IMG_0067.jpg
http://www.trevorhearn.com/Array/IMG_0069.jpg

So, while I am in a meeting about the array, oddly, I have this come rolling across the screen of the terminal session I am in...

Aug 10 10:53:43 XXXX kernel: g_vfs_done():da1p7[READ(offset=-6419569950008350720, length=16384)]error = 5
Aug 10 10:53:43 XXXX last message repeated 20 times
Aug 10 10:53:43 XXXX kernel: g_vfs_done():da1p7[READ(offset=-6419569950008350720, length=1638d)]error = 5
Aug 10 10:53:43 XXXX kernel: g_vfs_done():da1p7[READ(offset=-6419569950008350720, length=16384)]error = 5
Aug 10 10:53:43 XXXX last message repeated 18 times

When I say it was rolling across the screen, I mean it did it for about 5 minutes... I was waiting for the hard-lock to happen, but the process that was touching the file(s) went to 99.02%, and has stayed there the remainder of the day...

  PID USERNAME     THR PRI NICE   SIZE    RES STATE  C   TIME   WCPU COMMAND
 1351 xxxxxxxx        1  -8    0 10928K  4656K CPU1   0   2:10 99.02% smbd

While this happened earlier in the morning, which we were only seeing moderate useage:

Aug 10 09:54:18 PRSA kernel: pid 1776 (smbd), uid 1194 inumber 107797529 on /xxxxxxxxxx: bad block
Aug 10 09:54:18 PRSA kernel: bad block 165436921330628865, ino 107797529

The bad block number is WAAAY outside of what is used on the machine. So....

Everything that I have found relating to these problems is everyone asking, 'How do I fix this', and NONE of them so far have been a fix. 'Error = 5' relates to EIO, or an error in the input/output to a device. Now, that being said, I either have a problem with the controller in my Promise Array, which I am learning is possible, or, I have an issue with a driver in FreeBSD, and just happen to have a circumstance where it will appear. There does not seem to be a rhyme or reason to what is taking place. How does a set of array controllers throw a bad block error? I mean, with a standard drive, I can see it... but an array controller? Some other things that I have found...

The link below tells about using 'find / -type d -exec stat {} ;'  to run thru the filesystem and find the corrupted files. I did this earlier this morning, and found none. I went back thru several of the inodes that are showing in the pictures above, and only found one in existence. I battened down the hatches, and hit that directory. I was able to cp all of the info in that directory to another directory without a single problem. With all that I have been reading, this should have caused all manner of hell. I ran fsck on all directories, and got the server back online... Back online? Yes. It hard-locked at 3:09AM Sunday morning. Odd, since it has done that MANY times at 3:09 AM. I have Nagios watching the server, and it always seems to do so at the same time. I looked at cron jobs, and found that it runs PERIODIC DAILY at 3:01AM. My Nagios box checks every 5 minutes, with three intervals of one minute afterwards if a service is not available. SO, somewhere in the list of things that the server does in the PERIODIC DAILY job, there is something that makes the server fault. Tonight, I will be going thru the jobs, running them one by one, seeing exactly which one causes the fault. I have seen others speak of it going down at 3:00AMish, so I think this might be a bit of a clue.

At this point, I am purchasing another 2 port fibre channel card, with hopes of installing it in a spare 1U server I have, to migrate to Ubuntu, or similar. I'd like to test it out with Ubuntu, but I do not know at this point if it will see the array partitions correctly, nor if it will allow me to access the UFS partitions that are there. Worst case, I will backup, and re-format the chassis themselves. I would hope that this would not be necessary, but I am almost at my wit's end.

Has ANYONE got any ideas, other than the ones presented? I'm keen to see if there is a fix, because I love FreeBSD, but I can't be a evangelist for it when it is giving me so much grief. Thanks for listening, I'll be here all week. :)

-Trevor

________________________________________
From: John Baldwin [jhb at freebsd.org]
Sent: Friday, August 07, 2009 7:29 AM
To: freebsd-fs at freebsd.org
Cc: Hearn, Trevor
Subject: Re: UFS Filesystem issues, and the loss of my hair...

On Thursday 06 August 2009 9:51:04 am Hearn, Trevor wrote:
> First off, let me state that I love FreeBSD. I've used it for years, and
have not had any major problems with it... Until now.
>
> As you can tell, I work for a major university. I setup a large storage
array to hold data for a project they have here. No great shakes, just some
standard files and such. The fun started when I started loading users onto
the system, and they started using it... Isn't that always the case? Now, I
get ufs_dirbad errors, and the system hard locks. This isn't the worst thing
that could happen, but when you're talking about file partitions the size
that I am using, the fsck takes FOREVER. Somewhere on the order of 1.5 hours.
During that time, I am bringing the individual shares/partitions online, but
the users suffer. I've asked about this before, in a different forum, but got
no usable information that I could see. So, here goes...
>
> The system is as such. A dell 2950 1U server, with a Qlogic Fibre Channel
card. It is connected to two Promise Array chassis, 610 series, each with 16
drives. Each chassis is running RAID 6, which gives me about 12.73tb of
storage per chassis. From there, the logical drives are sliced up into
smaller partitions. At most, I have a 3.6tb partition. The smallest is a
100gig partition.
>
> Filesystem       Size    Used   Avail Capacity  Mounted on
> /dev/mfid0s1a    197G     10G    170G     6%    /
> devfs            1.0K    1.0K      0B   100%    /dev
> /dev/da0p1       1.8T    1.5T    130G    92%    /slice1
> /dev/da0p5       2.7T    1.8T    661G    74%    /slice2
> /dev/da0p9       250G     21G    209G     9%    /slice3
> /dev/da1p3       103G     12G     83G    12%    /slice4
> /dev/da1p4       205G     54G    135G    29%    /slice5
> /dev/da1p5       103G    7.3G     87G     8%    /slice6
> /dev/da1p6       103G     22G     72G    23%    /slice7
> etc...
>
> I had to use GPT to setup the partitions, and they are using UFS2 for the
filesystem. Now... If that's not fun enough... I have TWO of these creatures,
which RSYNC every 4 hours. The secondary system is across campus, and sits
idle 99% of the time. Every 4 hours, in a stepped schedule, the primary array
syncs to the secondary array. If the primary goes down, I FSCK, and any files
that are fried, I bring back across from the secondary and replace them. This
has worked OK for a while, but now I am getting Kernel Panics on a regular
basis. I've been told to migrate to a different filesystem, but my options
are ZFS and using GJOURNAL with UFS, from what I can tell. I need something
repeatable, simple, and I need something robust. I have NO idea why I keep
getting errors like this, but I imagine it's a cascading effect of other
hangs that have caused more corruption.
>
> I'd buy a fella, or gal, a cup of coffee and a pop-tart if they could help a
brother out. I have checked out this link:
>
http://phaq.phunsites.net/2007/07/01/ufs_dirbad-panic-with-mangled-entries-in-ufs/
> and decided that I need to give this a shot after hours, but being the kinda
guy I am, I need to make sure I am covering all of my bases.

Are you seeing ufs_dirbad panics?  Specifically, can you capture the messages
on the console when the machine panics?

--
John Baldwin