Re: BHYVE SNAPSHOT image format proposal

From: Vitaliy Gusev <gusev.vitaliy_at_gmail.com>
Date: Wed, 24 May 2023 15:10:49 UTC
Hi Tomek,

Try to answer to the all questions below, please let me know if I miss some important.


> On 23 May 2023, at 21:58, Tomek CEDRO <tomek@cedro.info> wrote:
> 
> On Tue, May 23, 2023 at 6:06 PM Vitaliy Gusev wrote:
>> Hi,
>> Here is a proposal for bhyve snapshot/checkpoint image format improvements.
>> It implies moving snapshot code to nvlist engine.
> 
> Hey there Vitaliy :-) bhyve getting more and more traction, I am new
> user of bhyve and no expert, but new and missing features are welcome
> I guess.. there was a discussion on the mailing lists recently on
> better snapshots mechanism :-)
> 
> 
>> Current snapshot implementation has disadvantages:
>> 3 files per snapshot: .meta, .kern, vram
> 
> No problem, unless new single file will be protected against
> corruption (filesystem, transfer, application crash) and possible to
> be easily and cheaply modified in place?

Current snapshot implementation doesn’t have it. I would say more, current
pkg implementation doesn’t track/notify if some of files are changed.  Binary files on a
system can be changed, for example ELF files, without any notification.

Tar doesn’t have protection for keeping data.  Some filesystems like ZFS
guarantee that data is not modified by underlying disks.

Protecting requires more efforts and it should be clearly defined: what is purpose. If
purpose is having checksum with 99.9% reliability, NVLIST HEADER can be widen
to have “checksum” key/value for a Section.

If purpose is having crypto verification - I believe sha256 program should be your choice.

> 
>> Binary Stream format of data.
> 
> This is small and fast? Will new format too?

Small is not so perfect. As the first attempt snapshot code is good. But if you want to get
values related to some specific device, for example, for NIC or HPET, you cannot get it easily. Please
try :)

Stream doesn’t have flexibility. It is good for well specified  and long long time discussed protocols
like XDR (NFS), when it has RFC and each position in the stream is described. Example: RFC1813.

New format with NVLIST has flexibility and is fast enough. Note, ZFS uses nvlist for keeping attributes 
and more another things.


>> Adding  optional variable - breaks resume
>> Removing variable - breaks resume
>> Changing saved order of variables - breaks resume
> 
> Obviously need improvement :-)
> 
>> Hard to get information about what is saved and decode.
>> Hard to debug if somethings goes wrong
> 
> Additional tools missing? Will new format allow text editor interaction?

Why do you need modify snapshot image ? Could you describe more? Do you
modify current 3 snapshot files?


>> No versions. If change code, resume of an old images can be
>> passed, but with UB.
> 
> Is new format future proof and provides backward compatibility?

Intention of moving to the new format - to have backward compatibility if some code
is changed:
Adding optional variable 
Removing variable that is not used anymore
Change order of saving variables
“Hot Fixes”.

If changes are critical and are incompatible, restore stage should have clear information about
incompatibility and break resume. Ideally it should be able to get informed even before starting
restore process. For this purpose, the new format introduce versions.


> 
>> New nvlist implementation should solve all things above. The first step -
>> improve snapshot/checkpoint saving format. It eliminates three files usage
>> per a snapshot.
>> 
>> (..)
> 
> So this will be new text config based format with variable = value and sections?

This is NVLIST approach with key=value, where key is string, and value can be
Integer, array, string, etc.

> 
> How much bigger will be the overal file size increase?

Not so huge. NVLIST internals is well specified. For example, for my VM

  [kernel]
        kernel.offset = 0x11f6 (4598)
        kernel.size = 0x19a7 (6567)
        kernel.type = “nvlist"
  [devices]
        devices.offset = 0x2b9d (11165)
        devices.size = 0x10145ba (16860602)
        devices.type = “nvlist”

So packed size for kernel  is 6567 bytes, for devices  is 16860602 including
framebuffer 16MB. If remove fbuf, packed nvlist devices Section has size 83386 bytes.


> 
> How much longer it will take do decode/encode/process files?

It is fast, just several milliseconds. NVLIST is very fast format. It is already integrated
into bhyve as Config engine.


> 
> What is the possibility of format change and backward/foward compatibility?

If you are talking about compatibility of a Image format - it should be compatible in
both directions, at least for not so big format changes.

If consider overall snapshot/resume compatibility - I believe  forward compatibility
is not case and target. Indeed, why do you need  to resume an image created by
a higher version of a program? 

The most important thing - backward compatibility, i.e. when an image is created
by an older version of a program, but should be resumed on a new one.

This is target and and intention of this improvement.

> 
> Have you considered efficiency comparison of current format, proposed
> format, and maybe using SQLITE or JSON storage/parsers?  For instance
> sqlite would be blazingly fast but hard to migrate. json would be most
> versatile but more time/memory consuming?

Yes, I know about another formats, like JSON or others. NVLIST is the most
effective and suitable for the current purposes.

> 
> Maybe EFL approach of storing configuration files for limited
> resources embedded system storage that use binary storage data but can
> be decompressed in chunks that can be replaced in place?
> https://www.enlightenment.org/develop/efl/start

There are many things that can be used, but it should be well known, easy, stable,
fast and supportable. I believe NVLIST is the best choice.

> 
> Sorry for asking those questions but there may be already good and
> verified solutions out there not to reinvent the wheel? :-)

Thank you for your questions. If you would like, you can try to test the new implementation and give feedback.

———
Vitaliy Gusev