Softupdates: df, du, sync and fsck [quite long]

Sat Jun 28 12:12:08 PDT 2003

John Ekins wrote:
> Hello Bill,
> 
> On Fri, 27 Jun 2003 23:53:30 -0400
> Bill Moran <wmoran at potentialtech.com> wrote:
> 
> -> I don't know what's wrong, but does unmounting and remounting the partition
> -> reclaim the lost space?
> 
> Alas, I can't umount the partition, my guess is because it is unable to sync
> (nothing to do with open files, and no error message saying "device busy"). The
> command just doesn't return after I've issued it.

Hmmm ... not good.  A little more research might qualify this problem for a PR.

> -> If there's a LOT of inodes with problems, it could easily take a while to fix. 
> -> Also, if you run fsck without specifying a filesystem to fix, it exhaustively
> -> checks all filesystems.  So even if the problem is on /var, it might spend a
> -> long time checking /usr as well.  You can work around this by calling fsck 
> -> with the filesystem to check.
> 
> I don't think it's to do with inodes or block size, etc. There's about 2M inodes
> on /var. A manual fsck on a dirty shutdown on this partition (ignoring the problem
> in hand) takes a couple of minutes.

Hmmm ...

> -> If these are production boxes, I'd recommend turning it off until you resolve
> -> the problem.
> 
> Indeed, I tried that last night on one machine and it put the load through the
> roof(48).

Yikes!  Is the machine still responsive?  Sometimes you can put the load that high
and still have a functional box.
I'm guessing by the way the conversation is going that you're able to grab one of
these boxes and make some tweaks.  Possibly try putting the spool directory on
a dedicated partition and mounting it async?  If the box shuts down dirty, you'll
probably have to newfs the partition before you can use it again.
At least make sure the spool partition is seperate from your log partition, that
should help to mitigate the problem (although you may already have done that).

> -> I don't know if this would qualify as "advice", but since nobody else
> -> seems to have any suggestions, I figured I'd throw my thoughts in.
> 
> -> Are you using ATA or SCSI drives? 
> 
> SCSI.
> 
> -> Does issuing a manual "sync" once you've stopped the spooling process help
> -> any?  
> 
> No. I'd already tried numerous syncs, and of course a clean shutdown tries that
> too.

I was wondering if maybe the syncs were taking longer than the shutdown process
was willing to wait.

> -> Are these all identical mobos ... possibly a BIOS update available?  
> 
> Haven't looked for an update, but I think they're all identical.

Hmmm ... but the fact that you're using SCSI makes this less of an issue, unless
it's onboard SCSI.  Possibly an update to the SCSI BIOS?

> -> These aren't IBM ATA drives are they?  I had one of those give  me grief for
> -> months (if you look in the archives, you should be able to find details on
> -> which drives caused problems).
> 
> Alas not! They're straightforward Seagates, which in other machines we use (much
> lighter load) don't have this problem.
> 
> -> Have you tried updating one of the machines to 4.8 to see if the problem
> -> has been fixed?
> 
> I haven't tried that yet but will do so. I'm also going to test a 5.1R machine,
> perhaps the background fsck will help when I alas come to reboot.

It may save you some time to look in CVS under the files for the drivers for the
SCSI subsystem as well as the drivers for you specific cards to see if any commit
messages talk about fixing problems like this.
My experience with background fsck is that the machine is slow as hell while the
background fsck is running.  Whether or not this is better or worse than what
you're experiencing with 4.7 is a question only you can answer.

> -> Like I said, not good advice, just some ideas for you.
> 
> All advice and ideas are welcome.

Well ... I'm really shooting in the dark with these suggestions, but hopefully
there will be something useful.

-- 
Bill Moran
Potential Technologies
http://www.potentialtech.com