Re: BHYVE SNAPSHOT image format proposal

From: Corvin Khne <corvink_at_FreeBSD.org>
Date: Mon, 05 Jun 2023 15:32:57 UTC
On Tue, 2023-05-23 at 19:05 +0300, Vitaliy Gusev wrote:
> Hi,
> 
> Here is a proposal for bhyve snapshot/checkpoint image format
> improvements.
> 

Hi Vitaliy,

thank you very much for sending this proposal to the mailing list.

> It implies moving snapshot code to nvlist engine. 
> 
> Current snapshot implementation has disadvantages:
> 
>  * 3 files per snapshot: .meta, .kern, vram
>  * Binary Stream format of data.
>  * Adding  optional variable - breaks resume
>  * Removing variable - breaks resume
>  * Changing saved order of variables - breaks resume
>  * Hard to get information about what is saved and decode.
>  * Hard to debug if somethings goes wrong
>  * No versions. If change code, resume of an old images can be
>    passed, but with UB.
> 
> New nvlist implementation should solve all things above. The first
> step -
> improve snapshot/checkpoint saving format. It eliminates three files
> usage
> per a snapshot.
> 
> 
> 1. BHYVE SNAPSHOT image format:  
> 
> > +———————————————————————————————————————+
> > |      HEADER PHYS - 4096 BYTES         |
> > +———————————————————————————————————————+
> > |                                       |
> > |                DATA                   |
> > |                                       |
> > +———————————————————————————————————————+
> 
> 2. HEADER PHYS format: 
> 
>  0    +—————————————————————————————————————————+ 
>       |        IDENT STRING  - 64 BYTES         |
>  64   +—————————————————————————————————————————+   
>       | NVLIST SIZE  - 4 BYTES   |  NVLIST DATA |
>  72   +—————————————————————————————————————————+
>       |                                         |
>       |               NVLIST DATA               |
>       |                                         |
>  4096 +—————————————————————————————————————————+
> 
> > 
> > IDENT STRING - Each producer can set its own value to specify
> > image.
> > NVLIST SIZE  - The following packed header nvlist data size.
> > NVLIST DATA - Packed nvlist header data.
> > 
> > 4KB should be enough for the HEADER to keep basic information about
> > Sections. However, it can
> > be enlarged lately, without breaking backward compatibility. 
> > 

I can't see an advantage of using a fixed sized header of 4KB. You have
to parse the offset and size of every section anyways. If it's for
alignment requirements you can still align all sections on save and set
the offset accordingly. So, why complicating things by using a fixed
header size?

The IDENT STRING seems to be very large. Even a GUID which should be a
global unique identifier uses just 16 Bytes. Additionally, it might be
worth using a dedicated ident and version field for an easier version
parsing. E.g.:

+------------------+-------------------+
| IDENT - 56 BYTES | VERSION - 8 BYTES |
+------------------+-------------------+

IDENT - "BHYVE CHECKPOINT IMAGE"
VERSION - 1 (as uint64_t)

Btw: I don't care but here we could leave some free space for possible
forward compatibility. E.g.:

+------------------+-------------------+-------------------------+
| IDENT - 16 BYTES | VERSION - 8 BYTES | _FREE_SPACE_ - 40 BYTES |
+------------------+-------------------+-------------------------+

> 3. NVLIST HEADER consists of Sections in the following format:
> 
>  * Name - string
>  * Type:    string:
>     - “text,   - plain text,
>     - “nvlist” - packed nvlist,
>     - “binary” - raw binary data.
>  * Size - Size of section - uint64
>  * Offset - Offset in image format - uint64
> > 
>     Predefined sections:  “config”, “devices”, “kernel”, “memory”. 
> 

LGTM.

> 
> 4. EXAMPLE:
> 
> 
>  IDENT STRING:
> 
>        "BHYVE CHECKPOINT IMAGE VERSION 1"
> 
>  NVLIST HEADER: 
> 
>   [config]
>         config.offset = 0x1000 (4096)
>         config.size = 0x1f6 (502)
>         config.type = "text"
>   [kernel]
>         kernel.offset = 0x11f6 (4598)
>         kernel.size = 0x19a7 (6567)
>         kernel.type = “nvlist"
>   [devices]
>         devices.offset = 0x2b9d (11165)
>         devices.size = 0x10145ba (16860602)
>         devices.type = "nvlist"
>   [memory]
>         memory.offset = 0x1200000 (18874368)
>         memory.size = 0x3ce00000 (1021313024)
>         memory.type = “binary"
> 
>  SECTIONS:
> 
>  [section "config" size 0x1f6 offset 0x1000]:
> > memory.size=1024M
> > x86.strictmsr=true
> > x86.vmexit_on_hlt=true
> > cpus=2
> > acpi_tables=true
> > pci.0.0.0.device=hostbridge
> > pci.0.31.0.device=lpc
> > pci.0.4.0.device=virtio-net
> > pci.0.4.0.backend=tap0
> > pci.0.7.0.device=fbuf
> > pci.0.7.0.tcp=10.42.0.78:5900
> > pci.0.7.0.w=1024
> > pci.0.7.0.h=768
> > pci.0.5.0.device=ahci
> > pci.0.5.0.port.0.type=cd
> > pci.0.5.0.port.0.path=/ISO/ubuntu-22.04.1-live-server-amd64.iso
> > lpc.bootrom=/usr/local/share/uefi-firmware/BHYVE_UEFI.fd
> > checkpoint.date="Wed Jan 25 23:48:29 2023"
> > name=ubuntu22
> 

Not sure if it's just an example for the "text" type. bhyve converts it
into a nvlist, so it could be saved directly as nvlist.

Btw: I would only implement the "text" type if there's an usecase that
can't be solved by one of the other types.

>  [section "kernel" size 0x19a7 offset 0x11f6]:
>    [vm]
>         vm.vds_version = 0x1 (1)
>         vm.cpu0.data(BINARY): 00 00 00 00 0D 00 00 00 01 00 00 00 00
> 00 00 00 ...  size=0x28
>         vm.cpu1.data(BINARY): 00 00 00 00 0D 00 00 00 01 00 00 00 00
> 00 00 00 ...  size=0x28
>         vm.checkpoint_tsc = 0xe2e0ac6fbe456 (3991273496896598)
>    [hpet]
>         hpet.vds_version = 0x1 (1)
>         hpet.data(BINARY): 01 00 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 ...  size=0x118
>    [vmx]
>         vmx.vds_version = 0x1 (1)
>         vmx.cpu_features = 0 (0)
>         vmx.cpu0.vmx_data(BINARY): F0 CC 15 B8 FF FF FF FF 40 B4 21
> B9 FF FF FF FF ...  size=0x288
>         vmx.cpu1.vmx_data(BINARY): F0 CC 15 B8 FF FF FF FF 00 00 67
> 41 D8 9B FF FF ...  size=0x288
>    [ioapic]
>         ioapic.vds_version = 0x1 (1)
>         ioapic.data(BINARY): 00 00 01 00 00 00 00 00 00 00 00 00 00
> 00 00 00 ...  size=0x208
>    [lapic]
>         lapic.vds_version = 0x1 (1)
>         lapic.cpu0.data(BINARY): 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00 ...  size=0x460
>         lapic.cpu1.data(BINARY): 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00 ...  size=0x460
>    [atpit]
>         atpit.vds_version = 0x1 (1)
>         atpit.data(BINARY): 00 00 00 00 00 00 00 00 54 AD 51 97 0F 0E
> 00 00 ...  size=0xa0
>    [atpic]
>         atpic.vds_version = 0x1 (1)
>         atpic.data(BINARY): 01 00 00 00 00 00 00 00 00 00 00 00 01 00
> 00 00 ...  size=0x84
>    [pmtimer]
>         pmtimer.vds_version = 0x1 (1)
>         pmtimer.uptime = 0x26fd133e5cc (2679274464716)
>    [rtc]
>         rtc.vds_version = 0x1 (1)
>         rtc.data(BINARY): 0A 00 00 00 00 FE FF FF 10 35 13 3D 40 F7
> 14 00 ...  size=0x98
> 
> —
> Thanks,
> Vitaliy Gusev
> 

All in all, it looks good. Keep on your work!

Regards checksum feature:
We should focus on enabling this feature by default before adding
advanced features. So, keep it simple and small.

Regards forward compatibility:
Backward compatibility is way more important than forward
compatibility. Nevertheless, forward compatibility would be nice to
have. So, we should keep it in mind when modifying the layout. For the
moment, just focus on a format which is backward compatible.


-- 
Kind regards,
Corvin