running mksnap_ffs

Tue Jan 16 21:20:55 UTC 2007

On Tue, Jan 16, 2007 at 01:17:33PM -0800, Doug Ambrisko wrote:
> Kris Kennaway writes:
> | On Tue, Jan 16, 2007 at 09:26:47PM +0100, Willem Jan Withagen wrote:
> | > Doug Ambrisko wrote:
> | > >| > or things can get wedged.  We have some other patches as well that 
> | > >might
> | > >| > be required.  As a hack on a local server we have been using snap shots
> | > >| > to do a "hot" back-up of a data base each morning.  This is based on
> | > >| > 6.x.
> | > >|
> | > >| What do you mean by "get wedged"?  Are you seeing a deadlock, and if
> | > >| so then what are the details?  When you say 6.x, do you mean
> | > >| up-to-date RELENG_6?  There were various snapshot deadlock fixes
> | > >| committed over the past year including some in the past few months.
> | > >
> | > >The file-system would come to a stop, processes stuck on bio, snap-shots
> | > >not finishing etc.  This was caused by the system running out of usable
> | > >buffers.  The change forces them to be flushed every so often.  This is
> | > >independant of locking.  10 might be to aggresive.  Some scaling of
> | > >nbuf would probably be better.
> | > 
> | > When I run mksnap_ffs it runs to the point where ANY access to the 
> | > filesystem gives that process a lockup.
> | 
> | Yes, that is expected.  Actually it begins when something accesses the
> | directory in which the snapshot is being made, since that causes the
> | parent directory to be locked...then something tries to access the
> | parent directory, which eventually cascades back to the root.
> | 
> | > Getting the file system back is only thru "hard reboot". Trying to do it 
> | > the gentle way locks the whole system.
> | 
> | Or waiting until the snapshot operation finishes.  You (still) haven't
> | determined that it's actually hanging as opposed to just waiting for
> | the snapshot operation to finish.
> 
> In my case is was easy to see that all the buffers were exhausted and
> the system was churning waiting for some to become available.  Since they
> were all used up it never recovered.  By sync'ing the buffers they got
> cleaned up and then the system never ran out.  The snap shot was then
> able to finish.  Via the debugger you can see this happen.  I traced
> this problem in the debugger.  There are other issues with the buffer
> deamon as well.  We hit these since we run with a relatively low
> nbuf.  The buffers can be get frag'ed so bad that it can't flush
> things since it can't get a full-size buffer.  Another problem is that
> it can end up waiting on itself since the current code can't use
> it's emergency space to flush stuff.  You can see this via ps etc.
> It's not a good thing if the buffer daemon is waiting on itself :-(
> 
> We have patches to this as well but they need some more work.  I was
> working with Tor, on this but then I got swamped at work with our 4.X -> 6.X
> and platform transition.  All I can say is that we don't suffer from
> these problems now :-)  I have printf's the log this stuff when some of
> these bugs are hit.  Now the system survives those lock-up points.

Thanks for clarifying.  Hopefully you and Tor can get something
committed soon!

Kris
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 187 bytes
Desc: not available
Url : http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20070116/6e4ad6e3/attachment.pgp