process checkpoint restore facility now in DragonFly BSD
Matthew Dillon
dillon at apollo.backplane.com
Mon Oct 20 17:01:32 PDT 2003
:I've lived through several checkpointing implementations. You've got the easy
:part. Applications must participate or such a facility has very limited
:usefulness. Delivering a signal is only part of the problem; there tend to
:be issues synchronizing user-mode checkpoint of application state with the
:kernel's desired to stop the process and squirrel away state.
:
:There's lots of stuff published about this; check the literature.
:
: Sam
Well, now it depends heavily on ones goals. There are a huge number
of scientific jobs that only need the type of basic checkpointing
that you see in, say, linux, which I believe can only handle sbrk()
space. Kip has taken it one step further with the file descriptor
and mapping save/restore. It's kinda silly to poo-poo the work when
the alternative is to have nothing at all. Being able to bite a chunk
out of a significant scientific application-set is important. There's
an obvious demand for even the very basic checkpointing capability that
you see in Linux and I personally believe that it can be done a whole lot
better in a BSD environment.
The work is also applicable to other things, like debugging. It's
a better savecore then savecore, so to speak. With just a tiny bit
of work one can checkpoint a running program and then check-restore it
into a stopped state and attach GDB to it without interfering with the
original process. You get the entire memory space and most of the
descriptors *intact*, and you get a live duplicate of the process,
making it possible to single step (at least up to a point) even a
program that normally could not be checkpointed. I'll take that
over the static image you get from a core file any day of the week!
In a non-SSI environment there are limits (which have not yet been
reached). In an SSI environment, however, which is one of DragonFly's
goals, one needs only to add cluster-wide filehandles and a
stall/restart capability and the checkpoint code will be able to
move the biggest chunk of the process --- it's anonymous memory, to
another physical machine, with the rest of pieces trailing behind.
That is why the work is so exciting to me. Even if SSI is not one of
your goals, the scientific and debugging benefits of the basic
capability cannot be denied. You do want to compete a bit more with
Linux don't you? Well, this is how it starts.
-Matt
Matthew Dillon
<dillon at backplane.com>
More information about the freebsd-hackers
mailing list