Errors on UFS Partitions

Sun Jan 17 02:36:18 UTC 2010

The-IRC FreeBSD wrote:

> Hi,
> 
> I am sorry if I am asking a question that might have been brought up
> before I have attempted to research my issue but it has many angles it
> might be listed under so please bare with me.
> 
> We have had ongoing problems with UFS Errors on our root partition (and
> any additional partition that did not have soft-updates enabled by
> default) and we recently had a problem with a secondary drive that housed
> home directories completely filled up and then everything locked up due-to
> huge CPU and Memory usage because nothing was able to write to the drive
> but when the server was rebooted it failed to bootup because of critical
> errors on the root partition.

A healthy system does not get UFS errors during normal operation.

> We have /etc and /usr on the root partition and our home/var partitions
> mistakenly do not have soft-updates flag set.
> 
> ::dmesg::
> http://the-irc.com/dmesg
> 
> ::mount::
> /dev/ad4s1a on / (ufs, local)
> devfs on /dev (devfs, local, multilabel)
> /dev/ad4s1d on /home (ufs, local, with quotas)
> /dev/ad4s1e on /tmp (ufs, local, noexec, nosuid, soft-updates)
> /dev/ad4s1f on /var (ufs, local)
> devfs on /var/named/dev (devfs, local, multilabel)
> procfs on /proc (procfs, local)
> /dev/ad0s1e on /Backups (ufs, local, soft-updates)
> /dev/ad0s1d on /root (ufs, local, soft-updates)
[snip]
> 
> To prevent letting these errors go out of control and not beable to fix
> the root partition errors without going into singleuser mode and the other
> partitions by mounting them with soft-updates flag, does anyone advise
> removing everything from the root partition and only leaving the
> bootloader and thus moving /etc and /usr (or most of all just /usr) to
> it's own partition or do you guys have a better solution.

No. Proceeding in directions such as this is a waste of time.

> Every partition gets errors over time but if you are unable to correct
> them without downtime how are you to correct them before they get out of
> control?

Probably by not looking for a software solution to a hardware problem. It is 
not normal for a file system to behave as you describe. Moving partitions 
around and other such avenues of approach are doomed to failure as they are 
not addressing the underlying problem.

Real server hardware with sophisticated ECC subsystems usually have some 
BIOS counters which you can check for stats on memory errors. Hard drives 
fail the most often but either bad memory or drive controller can readily 
corrupt data. If you have a RAID controller with RAM cache the RAM could be 
defective.

Hardware failure is going to mean downtime. But I'd be looking for a 
hardware problem, get it fixed, then worry about how to proceed. If you have 
decent backups from before the system was corrupted you can get back to 
where you need to be in relatively short order. Not fixing a hardware defect 
will result in you never getting your server back to normal operation.

-Mike