BHYVE SNAPSHOT image format proposal

From: Vitaliy Gusev <gusev.vitaliy_at_gmail.com>
Date: Tue, 23 May 2023 16:05:31 UTC
Hi,

Here is a proposal for bhyve snapshot/checkpoint image format improvements.

It implies moving snapshot code to nvlist engine. 

Current snapshot implementation has disadvantages:

3 files per snapshot: .meta, .kern, vram
Binary Stream format of data.
Adding  optional variable - breaks resume
Removing variable - breaks resume
Changing saved order of variables - breaks resume
Hard to get information about what is saved and decode.
Hard to debug if somethings goes wrong
No versions. If change code, resume of an old images can be
passed, but with UB.

New nvlist implementation should solve all things above. The first step -
improve snapshot/checkpoint saving format. It eliminates three files usage
per a snapshot.


1. BHYVE SNAPSHOT image format:  

+β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”+
|      HEADER PHYS - 4096 BYTES         |
+β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”+
|                                       |
|                DATA                   |
|                                       |
+β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”+


2. HEADER PHYS format: 

 0    +β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”+ 
      |        IDENT STRING  - 64 BYTES         |
 64   +β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”+   
      | NVLIST SIZE  - 4 BYTES   |  NVLIST DATA |
 72   +β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”+
      |                                         |
      |               NVLIST DATA               |
      |                                         |
 4096 +β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”+


IDENT STRING - Each producer can set its own value to specify image.
NVLIST SIZE  - The following packed header nvlist data size.
NVLIST DATA - Packed nvlist header data.

4KB should be enough for the HEADER to keep basic information about Sections. However, it can
be enlarged lately, without breaking backward compatibility. 

3. NVLIST HEADER consists of Sections in the following format:

Name - string
Type:    string:
β€œtext,   - plain text,
β€œnvlist” - packed nvlist,
β€œbinary” - raw binary data.
Size - Size of section - uint64
Offset - Offset in image format - uint64

    Predefined sections:  β€œconfig”, β€œdevices”, β€œkernel”, β€œmemory”. 


4. EXAMPLE:


 IDENT STRING:

       "BHYVE CHECKPOINT IMAGE VERSION 1"

 NVLIST HEADER: 

  [config]
        config.offset = 0x1000 (4096)
        config.size = 0x1f6 (502)
        config.type = "text"
  [kernel]
        kernel.offset = 0x11f6 (4598)
        kernel.size = 0x19a7 (6567)
        kernel.type = β€œnvlist"
  [devices]
        devices.offset = 0x2b9d (11165)
        devices.size = 0x10145ba (16860602)
        devices.type = "nvlist"
  [memory]
        memory.offset = 0x1200000 (18874368)
        memory.size = 0x3ce00000 (1021313024)
        memory.type = β€œbinary"

 SECTIONS:

 [section "config" size 0x1f6 offset 0x1000]:
memory.size=1024M
x86.strictmsr=true
x86.vmexit_on_hlt=true
cpus=2
acpi_tables=true
pci.0.0.0.device=hostbridge
pci.0.31.0.device=lpc
pci.0.4.0.device=virtio-net
pci.0.4.0.backend=tap0
pci.0.7.0.device=fbuf
pci.0.7.0.tcp=10.42.0.78:5900
pci.0.7.0.w=1024
pci.0.7.0.h=768
pci.0.5.0.device=ahci
pci.0.5.0.port.0.type=cd
pci.0.5.0.port.0.path=/ISO/ubuntu-22.04.1-live-server-amd64.iso
lpc.bootrom=/usr/local/share/uefi-firmware/BHYVE_UEFI.fd
checkpoint.date="Wed Jan 25 23:48:29 2023"
name=ubuntu22

 [section "kernel" size 0x19a7 offset 0x11f6]:
   [vm]
        vm.vds_version = 0x1 (1)
        vm.cpu0.data(BINARY): 00 00 00 00 0D 00 00 00 01 00 00 00 00 00 00 00 ...  size=0x28
        vm.cpu1.data(BINARY): 00 00 00 00 0D 00 00 00 01 00 00 00 00 00 00 00 ...  size=0x28
        vm.checkpoint_tsc = 0xe2e0ac6fbe456 (3991273496896598)
   [hpet]
        hpet.vds_version = 0x1 (1)
        hpet.data(BINARY): 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ...  size=0x118
   [vmx]
        vmx.vds_version = 0x1 (1)
        vmx.cpu_features = 0 (0)
        vmx.cpu0.vmx_data(BINARY): F0 CC 15 B8 FF FF FF FF 40 B4 21 B9 FF FF FF FF ...  size=0x288
        vmx.cpu1.vmx_data(BINARY): F0 CC 15 B8 FF FF FF FF 00 00 67 41 D8 9B FF FF ...  size=0x288
   [ioapic]
        ioapic.vds_version = 0x1 (1)
        ioapic.data(BINARY): 00 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00 ...  size=0x208
   [lapic]
        lapic.vds_version = 0x1 (1)
        lapic.cpu0.data(BINARY): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ...  size=0x460
        lapic.cpu1.data(BINARY): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ...  size=0x460
   [atpit]
        atpit.vds_version = 0x1 (1)
        atpit.data(BINARY): 00 00 00 00 00 00 00 00 54 AD 51 97 0F 0E 00 00 ...  size=0xa0
   [atpic]
        atpic.vds_version = 0x1 (1)
        atpic.data(BINARY): 01 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 ...  size=0x84
   [pmtimer]
        pmtimer.vds_version = 0x1 (1)
        pmtimer.uptime = 0x26fd133e5cc (2679274464716)
   [rtc]
        rtc.vds_version = 0x1 (1)
        rtc.data(BINARY): 0A 00 00 00 00 FE FF FF 10 35 13 3D 40 F7 14 00 ...  size=0x98

β€”
Thanks,
Vitaliy Gusev