From nobody Wed May 24 17:33:48 2023 X-Original-To: freebsd-hackers@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4QRJGB2MMVz4TJyg; Wed, 24 May 2023 17:34:26 +0000 (UTC) (envelope-from marietto2008@gmail.com) Received: from mail-yb1-xb2f.google.com (mail-yb1-xb2f.google.com [IPv6:2607:f8b0:4864:20::b2f]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1D4" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4QRJG946Pcz4FFD; Wed, 24 May 2023 17:34:25 +0000 (UTC) (envelope-from marietto2008@gmail.com) Authentication-Results: mx1.freebsd.org; dkim=pass header.d=gmail.com header.s=20221208 header.b=IIFeit2R; spf=pass (mx1.freebsd.org: domain of marietto2008@gmail.com designates 2607:f8b0:4864:20::b2f as permitted sender) smtp.mailfrom=marietto2008@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-yb1-xb2f.google.com with SMTP id 3f1490d57ef6-ba81deea9c2so1137434276.2; Wed, 24 May 2023 10:34:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1684949664; x=1687541664; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=daVIsXoWCxfEdp0Yy3GFJd1A7zl9NWnIHW82T+Gn7hU=; b=IIFeit2Roa2ZjoByscqIlT4bIzJoSA8fkM4ggPvIFCXKEY8uFM2DpB7PgfOdLDUIPN fQyc6C6Eu1mYDHaQosld2Gm9nnWk2NxTM/eU2a4jS0g/GJbtX+8Rq028unIRSR+5kED3 ItBDZHazs0D3jGXb8Vp/XdlZ8XXWDvAtDiwBEOoCda5b5XVJe8VqFM3tOIjWkbq3IAGI B16Y5CyPtnhFBNexmW/bYFXLbj8tsKW7PInhsxy6u825tr0K9e8cE0Uxg5SXFBhXPB5m WCD+jwpkj9Pna8UW8IDL1W0iPvFJ3eRId3IwZHzQitogC7LQLBPJrw2JjbiNpucSELax tYdQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1684949664; x=1687541664; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=daVIsXoWCxfEdp0Yy3GFJd1A7zl9NWnIHW82T+Gn7hU=; b=XoMcSgRu7iRw67oJnUMTaLLfgnLmyog7RdzZbBSxxC+cfnyoqGJgSFlIOyROtGAx/G hkgO5iNwNJ8tOgGuj4KcWNzbBCaHWWJARb+7mT/y60NorykP4A2eWrwoBF7JwzPFIPP4 a1MLpm1lKqSS//35oIoqYTZoVRaCyy1INwKkNAudh77u18k1LnCZEDRrDWGVG0oSYoIL dQejlyn2CRHtQPO2Bb9ClTiMF8/9Akarc3fzTYs8uEz8GLfzHmT4Jo3f567X9EiFQcfO OVk7RABW0BdbGfQ8bJgaXOck4mKWVT79eYTtvCFQFAAUT5ge/Jb4l3ZU9mjFrr+XKZtq H1Aw== X-Gm-Message-State: AC+VfDyJ0fRMLgjJm1N2HSWHfteg2WR+TTJoa3PMafvi/J/VDgbE3jup UwY3etnadc+ivOW7LycAFTDQcOao1PemI0ucDyGMHVf4GlI= X-Google-Smtp-Source: ACHHUZ6HWOttZ4LOsdCm4nxrJ/jdkpiVh2dlOq9tibJEWmGZbAAWsFlg7xmregcGiSCTYsSXc2WxSp2ikFTLq56R0UY= X-Received: by 2002:a25:bc9:0:b0:bac:6114:7376 with SMTP id 192-20020a250bc9000000b00bac61147376mr636720ybl.15.1684949664566; Wed, 24 May 2023 10:34:24 -0700 (PDT) List-Id: Technical discussions relating to FreeBSD List-Archive: https://lists.freebsd.org/archives/freebsd-hackers List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-hackers@freebsd.org MIME-Version: 1.0 References: <67FDC8A8-86A6-4AE4-85F0-FF7BEF9F2F06@gmail.com> In-Reply-To: From: Mario Marietto Date: Wed, 24 May 2023 19:33:48 +0200 Message-ID: Subject: Re: BHYVE SNAPSHOT image format proposal To: Vitaliy Gusev Cc: Tomek CEDRO , virtualization@freebsd.org, freebsd-hackers@freebsd.org Content-Type: multipart/alternative; boundary="00000000000038930005fc73ea5e" X-Spamd-Result: default: False [-2.44 / 15.00]; SUSPICIOUS_RECIPS(1.50)[]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; NEURAL_HAM_LONG(-1.00)[-1.000]; NEURAL_HAM_SHORT(-0.94)[-0.935]; DMARC_POLICY_ALLOW(-0.50)[gmail.com,none]; R_SPF_ALLOW(-0.20)[+ip6:2607:f8b0:4000::/36]; R_DKIM_ALLOW(-0.20)[gmail.com:s=20221208]; MIME_GOOD(-0.10)[multipart/alternative,text/plain]; TAGGED_RCPT(0.00)[]; MLMMJ_DEST(0.00)[virtualization@freebsd.org,freebsd-hackers@freebsd.org]; TO_MATCH_ENVRCPT_SOME(0.00)[]; FROM_HAS_DN(0.00)[]; RCVD_IN_DNSWL_NONE(0.00)[2607:f8b0:4864:20::b2f:from]; RCPT_COUNT_THREE(0.00)[4]; ARC_NA(0.00)[]; DWL_DNSWL_NONE(0.00)[gmail.com:dkim]; ASN(0.00)[asn:15169, ipnet:2607:f8b0::/32, country:US]; DKIM_TRACE(0.00)[gmail.com:+]; MID_RHS_MATCH_FROMTLD(0.00)[]; FREEMAIL_FROM(0.00)[gmail.com]; RCVD_COUNT_TWO(0.00)[2]; FREEMAIL_TO(0.00)[gmail.com]; RCVD_TLS_LAST(0.00)[]; FROM_EQ_ENVFROM(0.00)[]; FREEMAIL_ENVFROM(0.00)[gmail.com]; MIME_TRACE(0.00)[0:+,1:+,2:~]; TO_DN_SOME(0.00)[] X-Rspamd-Queue-Id: 4QRJG946Pcz4FFD X-Spamd-Bar: -- X-ThisMailContainsUnwantedMimeParts: N --00000000000038930005fc73ea5e Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable @gusev.vitaliy@gmail.com : Do you want to explain to me how to test the new "snapshot" feature ? I'm interested to test and stress it on my system. Is it ready to be used ? On Wed, May 24, 2023 at 5:11=E2=80=AFPM Vitaliy Gusev wrote: > Hi Tomek, > > Try to answer to the all questions below, please let me know if I miss > some important. > > > On 23 May 2023, at 21:58, Tomek CEDRO wrote: > > On Tue, May 23, 2023 at 6:06=E2=80=AFPM Vitaliy Gusev wrote: > > Hi, > Here is a proposal for bhyve snapshot/checkpoint image format improvement= s. > It implies moving snapshot code to nvlist engine. > > > Hey there Vitaliy :-) bhyve getting more and more traction, I am new > user of bhyve and no expert, but new and missing features are welcome > I guess.. there was a discussion on the mailing lists recently on > better snapshots mechanism :-) > > > Current snapshot implementation has disadvantages: > 3 files per snapshot: .meta, .kern, vram > > > No problem, unless new single file will be protected against > corruption (filesystem, transfer, application crash) and possible to > be easily and cheaply modified in place? > > > Current snapshot implementation doesn=E2=80=99t have it. I would say more= , current > pkg implementation doesn=E2=80=99t track/notify if some of files are chan= ged. > Binary files on a > system can be changed, for example ELF files, without any notification. > > Tar doesn=E2=80=99t have protection for keeping data. Some filesystems l= ike ZFS > guarantee that data is not modified by underlying disks. > > Protecting requires more efforts and it should be clearly defined: what i= s > purpose. If > purpose is having checksum with 99.9% reliability, NVLIST HEADER can be > widen > to have =E2=80=9Cchecksum=E2=80=9D key/value for a Section. > > If purpose is having crypto verification - I believe sha256 program shoul= d > be your choice. > > > Binary Stream format of data. > > > This is small and fast? Will new format too? > > > Small is not so perfect. As the first attempt snapshot code is good. But > if you want to get > values related to some specific device, for example, for NIC or HPET, you > cannot get it easily. Please > try :) > > Stream doesn=E2=80=99t have flexibility. It is good for well specified a= nd long > long time discussed protocols > like XDR (NFS), when it has RFC and each position in the stream is > described. Example: RFC1813. > > New format with NVLIST has flexibility and is fast enough. Note, ZFS uses > nvlist for keeping attributes > and more another things. > > > Adding optional variable - breaks resume > Removing variable - breaks resume > Changing saved order of variables - breaks resume > > > Obviously need improvement :-) > > Hard to get information about what is saved and decode. > Hard to debug if somethings goes wrong > > > Additional tools missing? Will new format allow text editor interaction? > > > Why do you need modify snapshot image ? Could you describe more? Do you > modify current 3 snapshot files? > > > No versions. If change code, resume of an old images can be > passed, but with UB. > > > Is new format future proof and provides backward compatibility? > > > Intention of moving to the new format - to have backward compatibility if > some code > is changed: > > > - Adding optional variable > - Removing variable that is not used anymore > - Change order of saving variables > - =E2=80=9CHot Fixes=E2=80=9D. > > > If changes are critical and are incompatible, restore stage should have > clear information about > incompatibility and break resume. Ideally it should be able to get > informed even before starting > restore process. For this purpose, the new format introduce versions. > > > > New nvlist implementation should solve all things above. The first step - > improve snapshot/checkpoint saving format. It eliminates three files usag= e > per a snapshot. > > (..) > > > So this will be new text config based format with variable =3D value and > sections? > > > This is NVLIST approach with key=3Dvalue, where key is string, and value = can > be > Integer, array, string, etc. > > > How much bigger will be the overal file size increase? > > > Not so huge. NVLIST internals is well specified. For example, for my VM > > [kernel] > > kernel.offset =3D 0x11f6 (4598) > > kernel.size =3D 0x19a7 (6567) > > kernel.type =3D =E2=80=9Cnvlist" > > [devices] > > devices.offset =3D 0x2b9d (11165) > > devices.size =3D 0x10145ba (16860602) > > devices.type =3D =E2=80=9Cnvlist=E2=80=9D > > So packed size for *kernel* is 6567 bytes, for *devices* is 16860602 > including > framebuffer 16MB. If remove fbuf, packed nvlist devices Section has size > 83386 bytes. > > > > How much longer it will take do decode/encode/process files? > > > It is fast, just several milliseconds. NVLIST is very fast format. It is > already integrated > into bhyve as Config engine. > > > > What is the possibility of format change and backward/foward compatibilit= y? > > > If you are talking about compatibility of a Image format - it should be > compatible in > both directions, at least for not so big format changes. > > If consider overall snapshot/resume compatibility - I believe forward > compatibility > is not case and target. Indeed, why do you need to resume an image > created by > a higher version of a program? > > The most important thing - backward compatibility, i.e. when an image is > created > by an older version of a program, but should be resumed on a new one. > > This is target and and intention of this improvement. > > > Have you considered efficiency comparison of current format, proposed > format, and maybe using SQLITE or JSON storage/parsers? For instance > sqlite would be blazingly fast but hard to migrate. json would be most > versatile but more time/memory consuming? > > > Yes, I know about another formats, like JSON or others. NVLIST is the mos= t > effective and suitable for the current purposes. > > > Maybe EFL approach of storing configuration files for limited > resources embedded system storage that use binary storage data but can > be decompressed in chunks that can be replaced in place? > https://www.enlightenment.org/develop/efl/start > > > There are many things that can be used, but it should be well known, easy= , > stable, > fast and supportable. I believe NVLIST is the best choice. > > > Sorry for asking those questions but there may be already good and > verified solutions out there not to reinvent the wheel? :-) > > > Thank you for your questions. If you would like, you can try to test the > new implementation and give feedback. > > =E2=80=94=E2=80=94=E2=80=94 > Vitaliy Gusev > > --=20 Mario. --00000000000038930005fc73ea5e Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
@gusev.vitaliy@gmail= .com : Do you want to explain to me how to test the new "snapshot&= quot; feature ? I'm interested to test and stress it on my system. Is i= t ready to be used ?

On Wed, May 24, 2023 at 5:11=E2=80=AFPM Vitaliy G= usev <gusev= .vitaliy@gmail.com> wrote:
Hi Tomek,

Try to answer to the al= l questions below, please let me know if I miss some important.
<= br>

On 23 May 2023, at 21= :58, Tomek CEDRO <= tomek@cedro.info> wrote:

On Tue, May 23, 2023 at = 6:06=E2=80=AFPM Vitaliy Gusev wrote:
Hi,
He= re is a proposal for bhyve snapshot/checkpoint image format improvements.It implies moving snapshot code to nvlist engine.

Hey= there Vitaliy :-) bhyve getting more and more traction, I am new
user o= f bhyve and no expert, but new and missing features are welcome
I guess.= . there was a discussion on the mailing lists recently on
better snapsho= ts mechanism :-)


Current snapshot impl= ementation has disadvantages:
3 files per snapshot: .meta, .kern, vram

No problem, unless new single file will be protected aga= inst
corruption (filesystem, transfer, application crash) and possible t= o
be easily and cheaply modified in place?
<= div>
Current snapshot implementation doesn=E2=80=99t have it.= I would say more, current
pkg implementation doesn=E2=80=99t tra= ck/notify if some of files are changed.=C2=A0 Binary files on a
s= ystem can be changed, for example ELF files, without any notification.

Tar doesn=E2=80=99t have protection for keeping d= ata.=C2=A0 Some filesystems like ZFS
guarantee that data is not m= odified by underlying disks.

Protecting requ= ires more efforts and it should be clearly defined: what is purpose. If
purpose is having checksum with 99.9%=C2=A0reliability, NVLIST HEADER can be widen=
to have =E2=80=9Cchecksum=E2=80=9D key/value for a Sectio= n.

If purpose=C2=A0is having crypto = verification - I believe sha256 program should be your choice.


Binary Stream format of data.

This = is small and fast? Will new format too?
Small is not so perfect. As the first attempt snapshot code is good= . But if you want to get
values related to some specific device, = for example, for NIC or HPET, you cannot get it easily. Please
tr= y :)

Stream doesn=E2=80=99t have flexibility.= It is good for well specified =C2=A0and long long time discussed protocols=
like XDR (NFS), when it has RFC and each position in the stream = is described. Example: RFC1813.

New format with NV= LIST has flexibility and is fast enough. Note, ZFS uses nvlist for keeping = attributes=C2=A0
and more another things.


A= dding =C2=A0optional variable - breaks resume
Removing variable - breaks= resume
Changing saved order of variables - breaks resume

Obviously need improvement :-)

Hard = to get information about what is saved and decode.
Hard to debug if some= things goes wrong

Additional tools missing? Will new fo= rmat allow text editor interaction?

Why do you need modify snapshot image ? Could you describe more? Do you=
modify current 3 snapshot files?


No versions. If change cod= e, resume of an old images can be
passed, but with UB.
<= br>Is new format future proof and provides backward compatibility?

Intention of moving to the new format - = to have backward compatibility if some code
is changed:
  • Adding optional variable=C2=A0
  • Removing variable t= hat is not used anymore
  • Change order of saving variables
  • = =E2=80=9CHot Fixes=E2=80=9D.

If changes are critical and are incompatible, restore stage should h= ave clear information about
incompatibility and break resume. Ide= ally it should be able to get informed even before starting
resto= re process. For this purpose, the new format introduce versions.
=



New nvlist implementation should solve all things above. The firs= t step -
improve snapshot/checkpoint saving format. It eliminates three = files usage
per a snapshot.

(..)

So this will= be new text config based format with variable =3D value and sections?
<= /div>

This is NVLIST approach with key=3Dv= alue, where key is string, and value can be
Integer, array, strin= g, etc.


How much bigg= er will be the overal file size increase?
=
Not so huge. NVLIST internals is well specified. For example, for= my VM

=C2=A0 [kernel]

=C2= =A0 =C2=A0 =C2=A0 =C2=A0 kernel.offset =3D 0x11f6 (4598)

=C2=A0 =C2=A0 =C2=A0 =C2=A0 kernel.size= =3D 0x19a7 (6567)

= =C2=A0 =C2=A0 =C2=A0 =C2=A0 kernel.type =3D =E2=80=9Cnvlist"

=C2=A0 [devices]

=C2=A0 =C2=A0 =C2=A0 =C2=A0 device= s.offset =3D 0x2b9d (11165)

=C2=A0 =C2=A0 =C2=A0 =C2=A0 devices.size =3D 0x10145ba (16860602)

=C2=A0 =C2=A0 =C2=A0 = =C2=A0 devices.type =3D =E2=80=9Cnvlist=E2=80=9D


S= o packed size for kernel=C2=A0 is 65= 67 bytes, for devices=C2=A0 is=C2=A016860602 including
frameb= uffer 16MB. If remove fbuf, packed nvlist devices Section has size=C2=A083386=C2=A0by= tes.



How much longer it will take do decode/encode/process files?
<= /div>

It is fast, just several millis= econds. NVLIST is very fast format. It is already integrated
into= bhyve as Config engine.


=

What is the possibility of format change and backward/foward = compatibility?

If you are t= alking about compatibility of a Image format - it should be compatible in
both directions, at least for not so big format changes.

If consider overall snapshot/resume compatibility - I = believe =C2=A0forward compatibility
is not case and target. Indee= d, why do you need =C2=A0to resume an image created by
a higher v= ersion of a program?=C2=A0

The most important thin= g - backward compatibility, i.e. when an image is created
by an o= lder version of a program, but should be resumed on a new one.
<= div>
This is target and and intention of this improvement.


Have you consider= ed efficiency comparison of current format, proposed
format, and maybe u= sing SQLITE or JSON storage/parsers?=C2=A0 For instance
sqlite would be = blazingly fast but hard to migrate. json would be most
versatile but mor= e time/memory consuming?

Ye= s, I know about another formats, like JSON or others. NVLIST is the most
effective and suitable for the current purposes.


Maybe EFL approach of storing con= figuration files for limited
resources embedded system storage that use = binary storage data but can
be decompressed in chunks that can be replac= ed in place?
https://www.enlightenment.org/develop/efl/start
<= /div>

There are many things that can = be used, but it should be well known, easy, stable,
fast and supp= ortable. I believe NVLIST is the best choice.


Sorry for asking those questions but there may be alrea= dy good and
verified solutions out there not to reinvent the wheel? :-)<= br>

Thank you for your questions. If= you would like, you can try to test the new implementation and give feedba= ck.

=E2=80=94=E2=80=94=E2=80=94
Vitaliy = Gusev


=
--
Mario.
--00000000000038930005fc73ea5e--