running mksnap_ffs
Kris Kennaway
kris at obsecurity.org
Tue Jan 16 21:20:55 UTC 2007
On Tue, Jan 16, 2007 at 01:17:33PM -0800, Doug Ambrisko wrote:
> Kris Kennaway writes:
> | On Tue, Jan 16, 2007 at 09:26:47PM +0100, Willem Jan Withagen wrote:
> | > Doug Ambrisko wrote:
> | > >| > or things can get wedged. We have some other patches as well that
> | > >might
> | > >| > be required. As a hack on a local server we have been using snap shots
> | > >| > to do a "hot" back-up of a data base each morning. This is based on
> | > >| > 6.x.
> | > >|
> | > >| What do you mean by "get wedged"? Are you seeing a deadlock, and if
> | > >| so then what are the details? When you say 6.x, do you mean
> | > >| up-to-date RELENG_6? There were various snapshot deadlock fixes
> | > >| committed over the past year including some in the past few months.
> | > >
> | > >The file-system would come to a stop, processes stuck on bio, snap-shots
> | > >not finishing etc. This was caused by the system running out of usable
> | > >buffers. The change forces them to be flushed every so often. This is
> | > >independant of locking. 10 might be to aggresive. Some scaling of
> | > >nbuf would probably be better.
> | >
> | > When I run mksnap_ffs it runs to the point where ANY access to the
> | > filesystem gives that process a lockup.
> |
> | Yes, that is expected. Actually it begins when something accesses the
> | directory in which the snapshot is being made, since that causes the
> | parent directory to be locked...then something tries to access the
> | parent directory, which eventually cascades back to the root.
> |
> | > Getting the file system back is only thru "hard reboot". Trying to do it
> | > the gentle way locks the whole system.
> |
> | Or waiting until the snapshot operation finishes. You (still) haven't
> | determined that it's actually hanging as opposed to just waiting for
> | the snapshot operation to finish.
>
> In my case is was easy to see that all the buffers were exhausted and
> the system was churning waiting for some to become available. Since they
> were all used up it never recovered. By sync'ing the buffers they got
> cleaned up and then the system never ran out. The snap shot was then
> able to finish. Via the debugger you can see this happen. I traced
> this problem in the debugger. There are other issues with the buffer
> deamon as well. We hit these since we run with a relatively low
> nbuf. The buffers can be get frag'ed so bad that it can't flush
> things since it can't get a full-size buffer. Another problem is that
> it can end up waiting on itself since the current code can't use
> it's emergency space to flush stuff. You can see this via ps etc.
> It's not a good thing if the buffer daemon is waiting on itself :-(
>
> We have patches to this as well but they need some more work. I was
> working with Tor, on this but then I got swamped at work with our 4.X -> 6.X
> and platform transition. All I can say is that we don't suffer from
> these problems now :-) I have printf's the log this stuff when some of
> these bugs are hit. Now the system survives those lock-up points.
Thanks for clarifying. Hopefully you and Tor can get something
committed soon!
Kris
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 187 bytes
Desc: not available
Url : http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20070116/6e4ad6e3/attachment.pgp
More information about the freebsd-stable
mailing list