Re: BHYVE SNAPSHOT image format proposal

From: Mario Marietto <marietto2008_at_gmail.com>
Date: Wed, 24 May 2023 17:33:48 UTC
@gusev.vitaliy@gmail.com <gusev.vitaliy@gmail.com> : Do you want to explain
to me how to test the new "snapshot" feature ? I'm interested to test and
stress it on my system. Is it ready to be used ?

On Wed, May 24, 2023 at 5:11 PM Vitaliy Gusev <gusev.vitaliy@gmail.com>
wrote:

> Hi Tomek,
>
> Try to answer to the all questions below, please let me know if I miss
> some important.
>
>
> On 23 May 2023, at 21:58, Tomek CEDRO <tomek@cedro.info> wrote:
>
> On Tue, May 23, 2023 at 6:06 PM Vitaliy Gusev wrote:
>
> Hi,
> Here is a proposal for bhyve snapshot/checkpoint image format improvements.
> It implies moving snapshot code to nvlist engine.
>
>
> Hey there Vitaliy :-) bhyve getting more and more traction, I am new
> user of bhyve and no expert, but new and missing features are welcome
> I guess.. there was a discussion on the mailing lists recently on
> better snapshots mechanism :-)
>
>
> Current snapshot implementation has disadvantages:
> 3 files per snapshot: .meta, .kern, vram
>
>
> No problem, unless new single file will be protected against
> corruption (filesystem, transfer, application crash) and possible to
> be easily and cheaply modified in place?
>
>
> Current snapshot implementation doesn’t have it. I would say more, current
> pkg implementation doesn’t track/notify if some of files are changed.
> Binary files on a
> system can be changed, for example ELF files, without any notification.
>
> Tar doesn’t have protection for keeping data.  Some filesystems like ZFS
> guarantee that data is not modified by underlying disks.
>
> Protecting requires more efforts and it should be clearly defined: what is
> purpose. If
> purpose is having checksum with 99.9% reliability, NVLIST HEADER can be
> widen
> to have “checksum” key/value for a Section.
>
> If purpose is having crypto verification - I believe sha256 program should
> be your choice.
>
>
> Binary Stream format of data.
>
>
> This is small and fast? Will new format too?
>
>
> Small is not so perfect. As the first attempt snapshot code is good. But
> if you want to get
> values related to some specific device, for example, for NIC or HPET, you
> cannot get it easily. Please
> try :)
>
> Stream doesn’t have flexibility. It is good for well specified  and long
> long time discussed protocols
> like XDR (NFS), when it has RFC and each position in the stream is
> described. Example: RFC1813.
>
> New format with NVLIST has flexibility and is fast enough. Note, ZFS uses
> nvlist for keeping attributes
> and more another things.
>
>
> Adding  optional variable - breaks resume
> Removing variable - breaks resume
> Changing saved order of variables - breaks resume
>
>
> Obviously need improvement :-)
>
> Hard to get information about what is saved and decode.
> Hard to debug if somethings goes wrong
>
>
> Additional tools missing? Will new format allow text editor interaction?
>
>
> Why do you need modify snapshot image ? Could you describe more? Do you
> modify current 3 snapshot files?
>
>
> No versions. If change code, resume of an old images can be
> passed, but with UB.
>
>
> Is new format future proof and provides backward compatibility?
>
>
> Intention of moving to the new format - to have backward compatibility if
> some code
> is changed:
>
>
>    - Adding optional variable
>    - Removing variable that is not used anymore
>    - Change order of saving variables
>    - “Hot Fixes”.
>
>
> If changes are critical and are incompatible, restore stage should have
> clear information about
> incompatibility and break resume. Ideally it should be able to get
> informed even before starting
> restore process. For this purpose, the new format introduce versions.
>
>
>
> New nvlist implementation should solve all things above. The first step -
> improve snapshot/checkpoint saving format. It eliminates three files usage
> per a snapshot.
>
> (..)
>
>
> So this will be new text config based format with variable = value and
> sections?
>
>
> This is NVLIST approach with key=value, where key is string, and value can
> be
> Integer, array, string, etc.
>
>
> How much bigger will be the overal file size increase?
>
>
> Not so huge. NVLIST internals is well specified. For example, for my VM
>
>   [kernel]
>
>         kernel.offset = 0x11f6 (4598)
>
>         kernel.size = 0x19a7 (6567)
>
>         kernel.type = “nvlist"
>
>   [devices]
>
>         devices.offset = 0x2b9d (11165)
>
>         devices.size = 0x10145ba (16860602)
>
>         devices.type = “nvlist”
>
> So packed size for *kernel*  is 6567 bytes, for *devices*  is 16860602
> including
> framebuffer 16MB. If remove fbuf, packed nvlist devices Section has size
> 83386 bytes.
>
>
>
> How much longer it will take do decode/encode/process files?
>
>
> It is fast, just several milliseconds. NVLIST is very fast format. It is
> already integrated
> into bhyve as Config engine.
>
>
>
> What is the possibility of format change and backward/foward compatibility?
>
>
> If you are talking about compatibility of a Image format - it should be
> compatible in
> both directions, at least for not so big format changes.
>
> If consider overall snapshot/resume compatibility - I believe  forward
> compatibility
> is not case and target. Indeed, why do you need  to resume an image
> created by
> a higher version of a program?
>
> The most important thing - backward compatibility, i.e. when an image is
> created
> by an older version of a program, but should be resumed on a new one.
>
> This is target and and intention of this improvement.
>
>
> Have you considered efficiency comparison of current format, proposed
> format, and maybe using SQLITE or JSON storage/parsers?  For instance
> sqlite would be blazingly fast but hard to migrate. json would be most
> versatile but more time/memory consuming?
>
>
> Yes, I know about another formats, like JSON or others. NVLIST is the most
> effective and suitable for the current purposes.
>
>
> Maybe EFL approach of storing configuration files for limited
> resources embedded system storage that use binary storage data but can
> be decompressed in chunks that can be replaced in place?
> https://www.enlightenment.org/develop/efl/start
>
>
> There are many things that can be used, but it should be well known, easy,
> stable,
> fast and supportable. I believe NVLIST is the best choice.
>
>
> Sorry for asking those questions but there may be already good and
> verified solutions out there not to reinvent the wheel? :-)
>
>
> Thank you for your questions. If you would like, you can try to test the
> new implementation and give feedback.
>
> ———
> Vitaliy Gusev
>
>

-- 
Mario.