From nobody Tue Jun 27 13:35:41 2023 X-Original-To: freebsd-virtualization@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4Qr5Md0ttMz4kNRw for ; Tue, 27 Jun 2023 13:36:13 +0000 (UTC) (envelope-from elenamihailescu22@gmail.com) Received: from mail-ed1-x52c.google.com (mail-ed1-x52c.google.com [IPv6:2a00:1450:4864:20::52c]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1D4" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4Qr5Mc1rGHz46pl; Tue, 27 Jun 2023 13:36:12 +0000 (UTC) (envelope-from elenamihailescu22@gmail.com) Authentication-Results: mx1.freebsd.org; dkim=pass header.d=gmail.com header.s=20221208 header.b=ecJFJGu9; spf=pass (mx1.freebsd.org: domain of elenamihailescu22@gmail.com designates 2a00:1450:4864:20::52c as permitted sender) smtp.mailfrom=elenamihailescu22@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-ed1-x52c.google.com with SMTP id 4fb4d7f45d1cf-51d9865b848so2735707a12.1; Tue, 27 Jun 2023 06:36:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1687872968; x=1690464968; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=NI6wAzMImo2wyGfWrZ+xJrDHdRGrxIHINH2OPINPn7o=; b=ecJFJGu9upM8D0qb+fLlh295EDPPp5+WYuYcOhhBD3gm06QN9E7+FDOlLFTpxXvTqf 56durlWPC4FkukEEwWkbb6uQ3kUhQEVE3d2ptJ5Jb/iV86L33TdIyQP8Nrm+gaoGCb+d Icheu3i0rTMu+JandDD9zUtip7Wh7s3MsMFK51y2QrnhOi4ZrZP2oKCaa/x1pjDLxBMQ klSREbAdf4rfekBfvN7/yL7/y2bn4OlXliWkq1Ja5/c83p9zXSqqZG2zLYYfbP74vyvB U4W1GbqP+22hM8l3ey9Oa+QIgAx31C2fGVUs+NB+1SWgW5/RTKToXYJAbjml26oJpbp3 CyVQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1687872968; x=1690464968; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=NI6wAzMImo2wyGfWrZ+xJrDHdRGrxIHINH2OPINPn7o=; b=MWOoWor253ekG7wu4XlFGzJ1CZHyRrQkWy0/7KVjykCia5YB0crjgyUtWDpqev5TlN QjAkwSd6JrbjH8qFZzuyh7pv4djgjGpWBLBUiKwfQ8QDHduOBwv59jVczo9wuuA+rFpo PKHr1HGI/XMNIybyW7Xa2bMqwKIayueJ8LTHB0WJ5N4O0i2c4X8lXe9I/tomacvrRghv N+NYU7pfxsAbFpk1aeiXvdcEGsj8FLHN23E8cnzTnkuxv9OKgTQCBk6WhRqzekR/30K0 uiBBIp6bBwibQjK0E6+1T0Zh7Evs44B5mARnifjKfIyGSiVbPVpZXdtCsqb0/a+KR2wF K86g== X-Gm-Message-State: AC+VfDxZhMVUd6/NqsKwK0fU8VoXlFGwjmAQSdXuMrHELpcR6MJu+kFo GeA5lIBJyq3ZmlhZm+T5GFziUMf749zEXsanA5VteZGyo+M= X-Google-Smtp-Source: ACHHUZ5Jp+LqhTL5d3rN1V+x84RaLVscaGUK2odbX8sJbMiqIWjCiOvMTEtcSuxyElW/8wxi+CY5XRTJ0zH0uMKmIb8= X-Received: by 2002:aa7:cac2:0:b0:51b:3062:311c with SMTP id l2-20020aa7cac2000000b0051b3062311cmr16600290edt.10.1687872968020; Tue, 27 Jun 2023 06:36:08 -0700 (PDT) List-Id: Discussion List-Archive: https://lists.freebsd.org/archives/freebsd-virtualization List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-virtualization@freebsd.org X-BeenThere: freebsd-virtualization@freebsd.org MIME-Version: 1.0 References: <3d7ee1f6ff98fe9aede5a85702b906fc3014b6b6.camel@FreeBSD.org> In-Reply-To: <3d7ee1f6ff98fe9aede5a85702b906fc3014b6b6.camel@FreeBSD.org> From: Elena Mihailescu Date: Tue, 27 Jun 2023 16:35:41 +0300 Message-ID: Subject: Re: Warm and Live Migration Implementation for bhyve To: =?UTF-8?Q?Corvin_K=C3=B6hne?= Cc: freebsd-virtualization@freebsd.org, Mihai Carabas , Matthew Grooms Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spamd-Result: default: False [-2.91 / 15.00]; NEURAL_HAM_MEDIUM(-0.92)[-0.924]; NEURAL_HAM_LONG(-0.71)[-0.713]; DMARC_POLICY_ALLOW(-0.50)[gmail.com,none]; NEURAL_HAM_SHORT(-0.27)[-0.270]; R_DKIM_ALLOW(-0.20)[gmail.com:s=20221208]; R_SPF_ALLOW(-0.20)[+ip6:2a00:1450:4000::/36:c]; MIME_GOOD(-0.10)[text/plain]; FROM_HAS_DN(0.00)[]; RCPT_COUNT_THREE(0.00)[4]; TAGGED_RCPT(0.00)[]; TO_MATCH_ENVRCPT_SOME(0.00)[]; RCVD_IN_DNSWL_NONE(0.00)[2a00:1450:4864:20::52c:from]; BLOCKLISTDE_FAIL(0.00)[2a00:1450:4864:20::52c:server fail]; MLMMJ_DEST(0.00)[freebsd-virtualization@freebsd.org]; TO_DN_SOME(0.00)[]; ARC_NA(0.00)[]; DWL_DNSWL_NONE(0.00)[gmail.com:dkim]; DKIM_TRACE(0.00)[gmail.com:+]; MID_RHS_MATCH_FROMTLD(0.00)[]; FREEMAIL_FROM(0.00)[gmail.com]; RCVD_COUNT_TWO(0.00)[2]; FROM_EQ_ENVFROM(0.00)[]; FREEMAIL_ENVFROM(0.00)[gmail.com]; MIME_TRACE(0.00)[0:+]; RCVD_TLS_LAST(0.00)[]; ASN(0.00)[asn:15169, ipnet:2a00:1450::/32, country:US]; FREEMAIL_CC(0.00)[freebsd.org,gmail.com,shrew.net] X-Rspamd-Queue-Id: 4Qr5Mc1rGHz46pl X-Spamd-Bar: -- X-ThisMailContainsUnwantedMimeParts: N Hi Corvin, Thank you for the questions! I'll respond to them inline. On Mon, 26 Jun 2023 at 10:16, Corvin K=C3=B6hne wrote= : > > Hi Elena, > > thanks for posting this proposal here. > > Some open questions from my side: > > 1. How is the data send to the target? Does the host send a complete > dump and the target parses it? Or does the target request data one by > one und the host sends it as response? > It's not a dump of the guest's state, it's transmitted in steps. However, some parts may be migrated as a chunk (e.g., the emulated devices' state is transmitted as the buffer generated from the snapshot functions). I'll try to describe a bit the protocol we have implemented for migration, maybe it can partially respond to the second and third questions. The destination host waits for the source host to connect (through a socket= ). After that, the source sends its system specifications (hw_machine, hw_model, hw_pagesize). If the source and destination hosts have identical hardware configurations, the migration can take place. Then, if we have live migration, we migrate the memory in rounds (i.e., we get a list of the pages that have the dirty bit set, send it to the destination to know what pages will be received, then send the pages through the socket; this process is repeated until the last round). Next, we stop the guest's vcpus, send the remaining memory (for live migration) or the guest's memory from vmctx->baseaddr for warm migration. Then, based on the suspend/resume feature, we get the state of the virtualized devices (the ones from the kernel space) and send this buffer to the destination. We repeat this for the emulated devices as well (the ones from the userspace). On the receiver host, we get the memory pages and set them to their according position in the guest's memory, use the restore functions for the state of the devices and start the guest's execution. Excluding the guest's memory transfer, the rest is based on the suspend/resume feature. We snapshot the guest's state, but instead of saving the data locally, we send it via network to the destination. On the destination host, we start a new virtual machine, but instead of reading/getting the state from the disk (i.e., the snapshot files) we get this state via the network from the source host. If the destination can properly resume the guest activity, it will send an "OK" to the source host so it can destroy/remove the guest from its end. Both warm and live migration are based on "cold migration". Cold migration means we suspend the guest on the source host, and restore the guest on the destination host from the snapshot files. Warm migration only does this using a socket, while live migration changes the way the memory is migrated. > 2. What happens if we add a new data section? > What are you referring to with a new data section? Is this question related to the third one? If so, see my answer below. > 3. What happens if the bhyve version differs on host and target > machine? The two hosts must be identical for migration, that's why we have the part where we check the specifications between the two migration hosts. They are expected to have the same version of bhyve and FreeBSD. We will add an additional check in the check specs part to see if we have the same FreeBSD build. As long as the changes in the virtual memory subsystem won't affect bhyve (and how the virtual machine sees/uses the memory), the migration constraints should only be related to suspend/resume. The state of the virtual devices is handled by the snapshot system, so if it is able to accommodate changes in the data structures, the migration process will not be affected. Thank you, Elena > > > -- > Kind regards, > Corvin > > On Fri, 2023-06-23 at 13:00 +0300, Elena Mihailescu wrote: > > Hello, > > > > This mail presents the migration feature we have implemented for > > bhyve. Any feedback from the community is much appreciated. > > > > We have opened a stack of reviews on Phabricator > > (https://reviews.freebsd.org/D34717) that is meant to split the code > > in smaller parts so it can be more easily reviewed. A brief history > > of > > the implementation can be found at the bottom of this email. > > > > The migration mechanism we propose needs two main components in order > > to move a virtual machine from one host to another: > > 1. the guest's state (vCPUs, emulated and virtualized devices) > > 2. the guest's memory > > > > For the first part, we rely on the suspend/resume feature. We call > > the > > same functions as the ones used by suspend/resume, but instead of > > saving the data in files, we send it via the network. > > > > The most time consuming aspect of migration is transmitting guest > > memory. The UPB team has implemented two options to accomplish this: > > 1. Warm Migration: The guest execution is suspended on the source > > host > > while the memory is sent to the destination host. This method is less > > complex but may cause extended downtime. > > 2. Live Migration: The guest continues to execute on the source host > > while the memory is transmitted to the destination host. This method > > is more complex but offers reduced downtime. > > > > The proposed live migration procedure (pre-copy live migration) > > migrates the memory in rounds: > > 1. In the initial round, we migrate all the guest memory (all pages > > that are allocated) > > 2. In the subsequent rounds, we migrate only the pages that were > > modified since the previous round started > > 3. In the final round, we suspend the guest, migrate the remaining > > pages that were modified from the previous round and the guest's > > internal state (vCPU, emulated and virtualized devices). > > > > To detect the pages that were modified between rounds, we propose an > > additional dirty bit (virtualization dirty bit) for each memory page. > > This bit would be set every time the page's dirty bit is set. > > However, > > this virtualization dirty bit is reset only when the page is > > migrated. > > > > The proposed implementation is split in two parts: > > 1. The first one, the warm migration, is just a wrapper on the > > suspend/resume feature which, instead of saving the suspended state > > on > > disk, sends it via the network to the destination > > 2. The second part, the live migration, uses the layer previously > > presented, but sends the guest's memory in rounds, as described > > above. > > > > The migration process works as follows: > > 1. we identify: > > - VM_NAME - the name of the virtual machine which will be migrated > > - SRC_IP - the IP address of the source host > > - DST_IP - the IP address of the destination host (default is 24983) > > - DST_PORT - the port we want to use for migration > > 2. we start a virtual machine on the destination host that will wait > > for a migration. Here, we must specify SRC_IP (and the port we want > > to > > open for migration, default is 24983). > > e.g.: bhyve ... -R SRC_IP:24983 guest_vm_dst > > 3. using bhyvectl on the source host, we start the migration process. > > e.g.: bhyvectl --migrate=3DDST_IP:24983 --vm=3Dguest_vm > > > > A full tutorial on this can be found here: > > https://github.com/FreeBSD-UPB/freebsd-src/wiki/Virtual-Machine-Migrati= on-using-bhyve > > > > For sending the migration request to a virtual machine, we use the > > same thread/socket that is used for suspend. > > For receiving a migration request, we used a similar approach to the > > resume process. > > > > As some of you may remember seeing similar emails from our part on > > the > > freebsd-virtualization list, I'll present a brief history of this > > project: > > The first part of the project was the suspend/resume implementation > > which landed in bhyve in 2020, under the BHYVE_SNAPSHOT guard > > (https://reviews.freebsd.org/D19495). > > After that, we focused on two tracks: > > 1. adding various suspend/resume features (multiple device support - > > https://reviews.freebsd.org/D26387, CAPSICUM support - > > https://reviews.freebsd.org/D30471, having an uniform file format - > > at > > that time, during the bhyve bi-weekly calls, we concluded that the > > JSON format was the most suitable at that time - > > https://reviews.freebsd.org/D29262) so we can remove the #ifdef > > BHYVE_SNAPSHOT guard. > > 2. implementing the migration feature for bhyve. Since this one > > relies > > on the save/restore, but does not modify its behaviour, we considered > > we can go in parallel with both tracks. > > We had various presentations in the FreeBSD Community on these > > topics: > > AsiaBSDCon2018, AsiaBSDCon2019, BSDCan2019, BSDCan2020, > > AsiaBSDCon2023. > > > > The first patches for warm and live migration were opened in 2021: > > https://reviews.freebsd.org/D28270, > > https://reviews.freebsd.org/D30954. However, the general feedback on > > these was that the patches are too big to be reviewed, so we should > > split them in smaller chunks (this was also true for some of the > > suspend/resume improvements). Thus, we split them into smaller parts. > > Also, as things changed in bhyve (i.e., capsicum support for > > suspend/resume was added this year), we rebased and updated our > > reviews. > > > > Thank you, > > Elena > > >