process checkpoint restore facility now in DragonFly BSD

Thu Jan 13 18:05:24 PST 2005

On Wed, Jan 12, 2005 at 01:40:02PM -0800, Brooks Davis wrote:
> On Wed, Jan 12, 2005 at 02:17:38PM -0700, Siddharth Aggarwal wrote:
> > 
> > I am responding to a post back in Oct 2003 when the checkpointing feature
> > was announced for DragonFly. I have been doing some research on this, and
> > have seen some projects that use Xen VMM to achieve checkpoints of guest
> > OSes.
> > 
> > So I was looking for inputs from people as to what everyone feels about
> > checkpointing, whether it should be done at the physical machine level or
> > VM level. Pros and Cons of each approach, if any further development was
> > done on DragonFly for checkpoint since then and if it was stopped, why?
> > Are there serious limitations to checkpointing a physical machine?
> > 
> > Sorry for such a vague posting, but I thought this would be a good
> > platform to get some feedback.
> 
> The DragonFly lists would be the logical place to discuss DragonFly
> features.
> 
> From my perspective as a scientific computing user, VM level
> checkpointing is it little use since I get the overhead of the VM and
> I can't easily do the application level checkpointing required to
> checkpoing distributed programs.  There are probably a number of places
> where it is useful in scientific computing, but I don't find it to be
> all that intresting.

IMHO, it all depends on if process checkpointing is made practical
and reliable enough to be employed for non-trivial programs.  I'm
not entirely convinced if a single system checkpoint is the
ultimate answer though that is certainly highly desirable.

One potential drawback with full system images is the lack of
support for runtime checkpoints (multiple process checkpoints) and
the lack of a framework for process migration and/or persistence
of a subset of the processes on a system.

Persistence is almost non-existent at all levels and sessioning
weak.  A whole solution is needed (integrating the two).  The work
thus far shouldn't be brushed off so easily as a multi-tiered approach
could be of benefit.

Each level of persistence offers it's own pros and cons:
	- Scope & Granularity of operation (degrees flexibility in
	  specification, checkpoint set);
	- Storage options;
	- Interface; - Means of Coordination;
	- etc.

For process checkpoint: The means to coordinate checkpoints and
satisfy order of dependency between processes under checkpoint is
a next step in the implementation path.

Building on previous email:
  *     Process Checkpointing Support:
	[..]
        An often overlooked application to process-level persistence
        is fault-tolerance.  It might be possible to have a process
        survive an otherwise fatal system panic and/or hardware
	failure.  [With-out having to resume from a whole system
	checkpoint.]
	[..]

> -- Brooks
> 
> -- 
> Any statement of the form "X is the one, true Y" is FALSE.
> PGP fingerprint 655D 519C 26A7 82E7 2529  9BF0 5D8E 8BE9 F238 1AD4

-- 
 Allan Fields, AFRSL - http://afields.ca
 2D4F 6806 D307 0889 6125  C31D F745 0D72 39B4 5541