running mksnap_ffs

Tue Jan 16 21:17:35 UTC 2007

Kris Kennaway writes:
| On Tue, Jan 16, 2007 at 09:26:47PM +0100, Willem Jan Withagen wrote:
| > Doug Ambrisko wrote:
| > >| > or things can get wedged.  We have some other patches as well that 
| > >might
| > >| > be required.  As a hack on a local server we have been using snap shots
| > >| > to do a "hot" back-up of a data base each morning.  This is based on
| > >| > 6.x.
| > >|
| > >| What do you mean by "get wedged"?  Are you seeing a deadlock, and if
| > >| so then what are the details?  When you say 6.x, do you mean
| > >| up-to-date RELENG_6?  There were various snapshot deadlock fixes
| > >| committed over the past year including some in the past few months.
| > >
| > >The file-system would come to a stop, processes stuck on bio, snap-shots
| > >not finishing etc.  This was caused by the system running out of usable
| > >buffers.  The change forces them to be flushed every so often.  This is
| > >independant of locking.  10 might be to aggresive.  Some scaling of
| > >nbuf would probably be better.
| > 
| > When I run mksnap_ffs it runs to the point where ANY access to the 
| > filesystem gives that process a lockup.
| 
| Yes, that is expected.  Actually it begins when something accesses the
| directory in which the snapshot is being made, since that causes the
| parent directory to be locked...then something tries to access the
| parent directory, which eventually cascades back to the root.
| 
| > Getting the file system back is only thru "hard reboot". Trying to do it 
| > the gentle way locks the whole system.
| 
| Or waiting until the snapshot operation finishes.  You (still) haven't
| determined that it's actually hanging as opposed to just waiting for
| the snapshot operation to finish.

In my case is was easy to see that all the buffers were exhausted and
the system was churning waiting for some to become available.  Since they
were all used up it never recovered.  By sync'ing the buffers they got
cleaned up and then the system never ran out.  The snap shot was then
able to finish.  Via the debugger you can see this happen.  I traced
this problem in the debugger.  There are other issues with the buffer
deamon as well.  We hit these since we run with a relatively low
nbuf.  The buffers can be get frag'ed so bad that it can't flush
things since it can't get a full-size buffer.  Another problem is that
it can end up waiting on itself since the current code can't use
it's emergency space to flush stuff.  You can see this via ps etc.
It's not a good thing if the buffer daemon is waiting on itself :-(

We have patches to this as well but they need some more work.  I was
working with Tor, on this but then I got swamped at work with our 4.X -> 6.X
and platform transition.  All I can say is that we don't suffer from
these problems now :-)  I have printf's the log this stuff when some of
these bugs are hit.  Now the system survives those lock-up points.

Doug A.