Re: Warm and Live Migration Implementation for bhyve

From: Elena Mihailescu <elenamihailescu22_at_gmail.com>
Date: Tue, 27 Jun 2023 13:35:41 UTC
Hi Corvin,

Thank you for the questions! I'll respond to them inline.

On Mon, 26 Jun 2023 at 10:16, Corvin Köhne <corvink@freebsd.org> wrote:
>
> Hi Elena,
>
> thanks for posting this proposal here.
>
> Some open questions from my side:
>
> 1. How is the data send to the target? Does the host send a complete
> dump and the target parses it? Or does the target request data one by
> one und the host sends it as response?
>
It's not a dump of the guest's state, it's transmitted in steps.
However, some parts may be migrated as a chunk (e.g., the emulated
devices' state is transmitted as the buffer generated from the
snapshot functions).

I'll try to describe a bit the protocol we have implemented for
migration, maybe it can partially respond to the second and third
questions.

The destination host waits for the source host to connect (through a socket).
After that, the source sends its system specifications (hw_machine,
hw_model, hw_pagesize). If the source and destination hosts have
identical hardware configurations, the migration can take place.

Then, if we have live migration, we migrate the memory in rounds
(i.e., we get a list of the pages that have the dirty bit set, send it
to the destination to know what pages will be received, then send the
pages through the socket; this process is repeated until the last
round).

Next, we stop the guest's vcpus, send the remaining memory (for live
migration) or the guest's memory from vmctx->baseaddr for warm
migration. Then, based on the suspend/resume feature, we get the state
of the virtualized devices (the ones from the kernel space) and send
this buffer to the destination. We repeat this for the emulated
devices as well (the ones from the userspace).

On the receiver host, we get the memory pages and set them to their
according position in the guest's memory, use the restore functions
for the state of the devices and start the guest's execution.

Excluding the guest's memory transfer, the rest is based on the
suspend/resume feature. We snapshot the guest's state, but instead of
saving the data locally, we send it via network to the destination. On
the destination host, we start a new virtual machine, but instead of
reading/getting the state from the disk (i.e., the snapshot files) we
get this state via the network from the source host.

If the destination can properly resume the guest activity, it will
send an "OK" to the source host so it can destroy/remove the guest
from its end.

Both warm and live migration are based on "cold migration". Cold
migration means we suspend the guest on the source host, and restore
the guest on the destination host from the snapshot files. Warm
migration only does this using a socket, while live migration changes
the way the memory is migrated.

> 2. What happens if we add a new data section?
>
What are you referring to with a new data section? Is this question
related to the third one? If so, see my answer below.

> 3. What happens if the bhyve version differs on host and target
> machine?

The two hosts must be identical for migration, that's why we have the
part where we check the specifications between the two migration
hosts. They are expected to have the same version of bhyve and
FreeBSD. We will add an additional check in the check specs part to
see if we have the same FreeBSD build.

As long as the changes in the virtual memory subsystem won't affect
bhyve (and how the virtual machine sees/uses the memory), the
migration constraints should only be related to suspend/resume. The
state of the virtual devices is handled by the snapshot system, so if
it is able to accommodate changes in the data structures, the
migration process will not be affected.

Thank you,
Elena

>
>
> --
> Kind regards,
> Corvin
>
> On Fri, 2023-06-23 at 13:00 +0300, Elena Mihailescu wrote:
> > Hello,
> >
> > This mail presents the migration feature we have implemented for
> > bhyve. Any feedback from the community is much appreciated.
> >
> > We have opened a stack of reviews on Phabricator
> > (https://reviews.freebsd.org/D34717) that is meant to split the code
> > in smaller parts so it can be more easily reviewed. A brief history
> > of
> > the implementation can be found at the bottom of this email.
> >
> > The migration mechanism we propose needs two main components in order
> > to move a virtual machine from one host to another:
> > 1. the guest's state (vCPUs, emulated and virtualized devices)
> > 2. the guest's memory
> >
> > For the first part, we rely on the suspend/resume feature. We call
> > the
> > same functions as the ones used by suspend/resume, but instead of
> > saving the data in files, we send it via the network.
> >
> > The most time consuming aspect of migration is transmitting guest
> > memory. The UPB team has implemented two options to accomplish this:
> > 1. Warm Migration: The guest execution is suspended on the source
> > host
> > while the memory is sent to the destination host. This method is less
> > complex but may cause extended downtime.
> > 2. Live Migration: The guest continues to execute on the source host
> > while the memory is transmitted to the destination host. This method
> > is more complex but offers reduced downtime.
> >
> > The proposed live migration procedure (pre-copy live migration)
> > migrates the memory in rounds:
> > 1. In the initial round, we migrate all the guest memory (all pages
> > that are allocated)
> > 2. In the subsequent rounds, we migrate only the pages that were
> > modified since the previous round started
> > 3. In the final round, we suspend the guest, migrate the remaining
> > pages that were modified from the previous round and the guest's
> > internal state (vCPU, emulated and virtualized devices).
> >
> > To detect the pages that were modified between rounds, we propose an
> > additional dirty bit (virtualization dirty bit) for each memory page.
> > This bit would be set every time the page's dirty bit is set.
> > However,
> > this virtualization dirty bit is reset only when the page is
> > migrated.
> >
> > The proposed implementation is split in two parts:
> > 1. The first one, the warm migration, is just a wrapper on the
> > suspend/resume feature which, instead of saving the suspended state
> > on
> > disk, sends it via the network to the destination
> > 2. The second part, the live migration, uses the layer previously
> > presented, but sends the guest's memory in rounds, as described
> > above.
> >
> > The migration process works as follows:
> > 1. we identify:
> >  - VM_NAME - the name of the virtual machine which will be migrated
> >  - SRC_IP - the IP address of the source host
> >  - DST_IP - the IP address of the destination host (default is 24983)
> >  - DST_PORT - the port we want to use for migration
> > 2. we start a virtual machine on the destination host that will wait
> > for a migration. Here, we must specify SRC_IP (and the port we want
> > to
> > open for migration, default is 24983).
> > e.g.: bhyve ... -R SRC_IP:24983 guest_vm_dst
> > 3. using bhyvectl on the source host, we start the migration process.
> > e.g.: bhyvectl --migrate=DST_IP:24983 --vm=guest_vm
> >
> > A full tutorial on this can be found here:
> > https://github.com/FreeBSD-UPB/freebsd-src/wiki/Virtual-Machine-Migration-using-bhyve
> >
> > For sending the migration request to a virtual machine, we use the
> > same thread/socket that is used for suspend.
> > For receiving a migration request, we used a similar approach to the
> > resume process.
> >
> > As some of you may remember seeing similar emails from our part on
> > the
> > freebsd-virtualization list, I'll present a brief history of this
> > project:
> > The first part of the project was the suspend/resume implementation
> > which landed in bhyve in 2020, under the BHYVE_SNAPSHOT guard
> > (https://reviews.freebsd.org/D19495).
> > After that, we focused on two tracks:
> > 1. adding various suspend/resume features (multiple device support -
> > https://reviews.freebsd.org/D26387, CAPSICUM support -
> > https://reviews.freebsd.org/D30471, having an uniform file format -
> > at
> > that time, during the bhyve bi-weekly calls, we concluded that the
> > JSON format was the most suitable at that time -
> > https://reviews.freebsd.org/D29262) so we can remove the #ifdef
> > BHYVE_SNAPSHOT guard.
> > 2. implementing the migration feature for bhyve. Since this one
> > relies
> > on the save/restore, but does not modify its behaviour, we considered
> > we can go in parallel with both tracks.
> > We had various presentations in the FreeBSD Community on these
> > topics:
> > AsiaBSDCon2018, AsiaBSDCon2019, BSDCan2019, BSDCan2020,
> > AsiaBSDCon2023.
> >
> > The first patches for warm and live migration were opened in 2021:
> > https://reviews.freebsd.org/D28270,
> > https://reviews.freebsd.org/D30954. However, the general feedback on
> > these was that the patches are too big to be reviewed, so we should
> > split them in smaller chunks (this was also true for some of the
> > suspend/resume improvements). Thus, we split them into smaller parts.
> > Also, as things changed in bhyve (i.e., capsicum support for
> > suspend/resume was added this year), we rebased and updated our
> > reviews.
> >
> > Thank you,
> > Elena
> >
>