Re: git: 2a58b312b62f - main - zfs: merge openzfs/zfs@431083f75
- Reply: Pawel Jakub Dawidek : "Re: git: 2a58b312b62f - main - zfs: merge openzfs/zfs@431083f75"
- Reply: Charlie Li : "Re: git: 2a58b312b62f - main - zfs: merge openzfs/zfs@431083f75"
- In reply to: Cy Schubert : "Re: git: 2a58b312b62f - main - zfs: merge openzfs/zfs@431083f75"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Thu, 13 Apr 2023 14:05:04 UTC
On Thu, Apr 13, 2023 at 06:56:35AM -0700, Cy Schubert wrote: > In message <CAGudoHG3rCx93gyJTmzTBnSe4fQ9=m4mBESWbKVWtAGRxen_4w@mail.gmail.c > om> > , Mateusz Guzik writes: > > On 4/13/23, Cy Schubert <Cy.Schubert@cschubert.com> wrote: > > > On Thu, 13 Apr 2023 19:54:42 +0900 > > > Pawe=C5=82 Jakub Dawidek <pawel@dawidek.net> wrote: > > > > > >> On Apr 13, 2023, at 16:10, Cy Schubert <Cy.Schubert@cschubert.com> wrote= > > : > > >> > > > >> > =EF=BB=BFIn message <20230413070426.8A54F25A@slippy.cwsent.com>, Cy Sc= > > hubert > > >> > writes: > > >> > In message <20230413064252.1E5C1318@slippy.cwsent.com>, Cy Schubert > > >> > writes: > > >> >> In message <A291C24C-9D7C-4E79-AD03-68ED910FC2DE@yahoo.com>, Mark > > >> >> Millard > > >> >>> write > > >> >>> s: > > >> >>> [This just puts my prior reply's material into Cy's > > >> >>>> adjusted resend of the original. The To/Cc should > > >> >>>> be coomplete this time.] > > >> >>>> > > >> >>>> On Apr 12, 2023, at 22:52, Cy Schubert <Cy.Schubert@cschubert.com> = > > =3D > > >> >>>> wrote: > > >> >>>> > > >> >>>> In message <C8E4A43B-9FC8-456E-ADB3-13E7F40B2B04@yahoo.com>, Mark = > > =3D > > >> >>>>> Millard=3D20 > > >> >>>> write > > >> >>>>> s: > > >> >>>>> From: Charlie Li <vishwin_at_freebsd.org> wrote on > > >> >>>>>> Date: Wed, 12 Apr 2023 20:11:16 UTC : > > >> >>>>>> =3D20 > > >> >>>>>> Charlie Li wrote: > > >> >>>>>>> Mateusz Guzik wrote: > > >> >>>>>>>> can you please test poudriere with > > >> >>>>>>>>> https://github.com/openzfs/zfs/pull/14739/files > > >> >>>>>>>>> =3D20 > > >> >>>>>>>>> After applying, on the md(4)-backed pool regardless of =3D3D > > >> >>>>>>>> block_cloning,=3D3D20 > > >> >>>>>> the cy@ `cp -R` test reports no differing (ie corrupted) files. = > > =3D > > >> >>>>>>>> Will=3D3D20=3D3D > > >> >>>> =3D20 > > >> >>>>>> report back on poudriere results (no block_cloning). > > >> >>>>>>>> =3D3D20 > > >> >>>>>>>> As for poudriere, build failures are still rolling in. These ar= > > e > > >> >>>>>>>> =3D > > >> >>>>>>> (and=3D3D20=3D3D > > >> >>>> =3D20 > > >> >>>>>> have been) entirely random on every run. Some examples from this = > > =3D > > >> >>>>>>> run: > > >> >>>> =3D3D20 > > >> >>>>>>> lang/php81: > > >> >>>>>>> - post-install: @${INSTALL_DATA} > > >> >>>>>>> ${WRKSRC}/php.ini-development=3D3D20 > > >> >>>>>>> ${WRKSRC}/php.ini-production ${WRKDIR}/php.conf =3D3D > > >> >>>>>>> ${STAGEDIR}/${PREFIX}/etc > > >> >>>>>> - consumers fail to build due to corrupted php.conf packaged > > >> >>>>>>> =3D3D20 > > >> >>>>>>> devel/ninja: > > >> >>>>>>> - phase: stage > > >> >>>>>>> - install -s -m 555=3D3D20 > > >> >>>>>>> /wrkdirs/usr/ports/devel/ninja/work/ninja-1.11.1/ninja=3D3D20 > > >> >>>>>>> /wrkdirs/usr/ports/devel/ninja/work/stage/usr/local/bin > > >> >>>>>>> - consumers fail to build due to corrupted bin/ninja packaged > > >> >>>>>>> =3D3D20 > > >> >>>>>>> devel/netsurf-buildsystem: > > >> >>>>>>> - phase: stage > > >> >>>>>>> - mkdir -p=3D3D20 > > >> >>>>>>> =3D3D > > >> >>>>>>> =3D > > >> >>>>>> /wrkdirs/usr/ports/devel/netsurf-buildsystem/work/stage/usr/local= > > /share/n > > >> >>>> e=3D > > >> >> =3D3D > > >> >>>> tsurf-buildsystem/makefiles=3D3D20 > > >> >>>>>> =3D3D > > >> >>>>>>> =3D > > >> >>>>>> /wrkdirs/usr/ports/devel/netsurf-buildsystem/work/stage/usr/local= > > /share/n > > >> >>>> e=3D > > >> >> =3D3D > > >> >>>> tsurf-buildsystem/testtools > > >> >>>>>> for M in Makefile.top Makefile.tools Makefile.subdir =3D3D > > >> >>>>>>> Makefile.pkgconfig=3D3D20 > > >> >>>>>> Makefile.clang Makefile.gcc Makefile.norcroft Makefile.open64; do > > >> >>>>>> \ > > >> >>>>>>> cp makefiles/$M=3D3D20 > > >> >>>>>>> =3D3D > > >> >>>>>>> =3D > > >> >>>>>> /wrkdirs/usr/ports/devel/netsurf-buildsystem/work/stage/usr/local= > > /share/n > > >> >>>> e=3D > > >> >> =3D3D > > >> >>>> tsurf-buildsystem/makefiles/;=3D3D20 > > >> >>>>>> \ > > >> >>>>>>> done > > >> >>>>>>> - graphics/libnsgif fails to build due to NUL characters in=3D3D= > > 20 > > >> >>>>>>> Makefile.{clang,subdir}, causing nothing to link > > >> >>>>>>> =3D20 > > >> >>>>>> Summary: I have problems building ports into packages > > >> >>>>>> via poudriere-devel use despite being fully updated/patched > > >> >>>>>> (as of when I started the experiment), never having enabled > > >> >>>>>> block_cloning ( still using openzfs-2.1-freebsd ). > > >> >>>>>> =3D20 > > >> >>>>>> In other words, I can confirm other reports that have > > >> >>>>>> been made. > > >> >>>>>> =3D20 > > >> >>>>>> The details follow. > > >> >>>>>> =3D20 > > >> >>>>>> =3D20 > > >> >>>>>> [Written as I was working on setting up for the experiments > > >> >>>>>> and then executing those experiments, adjusting as I went > > >> >>>>>> along.] > > >> >>>>>> =3D20 > > >> >>>>>> I've run my own tests in a context that has never had the > > >> >>>>>> zpool upgrade and that jump from before the openzfs import to > > >> >>>>>> after the existing commits for trying to fix openzfs on > > >> >>>>>> FreeBSD. I report on the sequence of activities getting to > > >> >>>>>> the point of testing as well. > > >> >>>>>> =3D20 > > >> >>>>>> By personal policy I keep my (non-temporary) pool's compatible > > >> >>>>>> with what the most recent ??.?-RELEASE supports, using > > >> >>>>>> openzfs-2.1-freebsd for now. The pools involved below have > > >> >>>>>> never had a zpool upgrade from where they started. (I've no > > >> >>>>>> pools that have ever had a zpool upgrade.) > > >> >>>>>> =3D20 > > >> >>>>>> (Temporary pools are rare for me, such as this investigation. > > >> >>>>>> But I'm not testing block_cloning or anything new this time.) > > >> >>>>>> =3D20 > > >> >>>>>> I'll note that I use zfs for bectl, not for redundancy. So > > >> >>>>>> my evidence is more limited in that respect. > > >> >>>>>> =3D20 > > >> >>>>>> The activities were done on a HoneyComb (16 Cortex-A72 cores). > > >> >>>>>> The system has and supports ECC RAM, 64 GiBytes of RAM are > > >> >>>>>> present. > > >> >>>>>> =3D20 > > >> >>>>>> I started by duplicating my normal zfs environment to an > > >> >>>>>> external USB3 NVMe drive and adjusting the host name and such > > >> >>>>>> to produce the below. (Non-debug, although I do not strip > > >> >>>>>> symbols.) : > > >> >>>>>> =3D20 > > >> >>>>>> # uname -apKU > > >> >>>>>> FreeBSD CA72_4c8G_ZFS 14.0-CURRENT FreeBSD 14.0-CURRENT #90 =3D3D > > >> >>>>>> main-n261544-cee09bda03c8-dirty: Wed Mar 15 20:25:49 PDT 2023 > > >> >>>>>> =3D3D > > >> >>>>>> =3D > > >> >>>>>> root@CA72_16Gp_ZFS:/usr/obj/BUILDs/main-CA72-nodbg-clang/usr/main= > > -src/arm > > >> >>>> 6=3D > > >> >> =3D3D > > >> >>>> 4.aarch64/sys/GENERIC-NODBG-CA72 arm64 aarch64 1400082 1400082 > > >> >>>>>> =3D20 > > >> >>>>>> I then did: git fetch, stash push ., merge --ff-only, stash apply= > > . > > >> >>>>>> : > > >> >>>>>> my normal procedure. I then also applied the patch from: > > >> >>>>>> =3D20 > > >> >>>>>> https://github.com/openzfs/zfs/pull/14739/files > > >> >>>>>> =3D20 > > >> >>>>>> Then I did: buildworld buildkernel, install them, and rebooted. > > >> >>>>>> =3D20 > > >> >>>>>> The result was: > > >> >>>>>> =3D20 > > >> >>>>>> # uname -apKU > > >> >>>>>> FreeBSD CA72_4c8G_ZFS 14.0-CURRENT FreeBSD 14.0-CURRENT #91 =3D3D > > >> >>>>>> main-n262122-2ef2c26f3f13-dirty: Wed Apr 12 19:23:35 PDT 2023 > > >> >>>>>> =3D3D > > >> >>>>>> =3D > > >> >>>>>> root@CA72_4c8G_ZFS:/usr/obj/BUILDs/main-CA72-nodbg-clang/usr/main= > > -src/arm > > >> >>>> 6=3D > > >> >> =3D3D > > >> >>>> 4.aarch64/sys/GENERIC-NODBG-CA72 arm64 aarc> >> >>>>>> =3D20 > > >> >>>>>> The later poudriere-devel based build of packages from ports is > > >> >>>>>> based on: > > >> >>>>>> =3D20 > > >> >>>>>> # ~/fbsd-based-on-what-commit.sh -C /usr/ports > > >> >>>>>> 4e94ac9eb97f (HEAD -> main, freebsd/main, freebsd/HEAD) =3D3D > > >> >>>>>> devel/freebsd-gcc12: Bump to 12.2.0. > > >> >>>>>> Author: John Baldwin <jhb@FreeBSD.org> > > >> >>>>>> Commit: John Baldwin <jhb@FreeBSD.org> > > >> >>>>>> CommitDate: 2023-03-25 00:06:40 +0000 > > >> >>>>>> branch: main > > >> >>>>>> merge-base: 4e94ac9eb97fab16510b74ebcaa9316613182a72 > > >> >>>>>> merge-base: CommitDate: 2023-03-25 00:06:40 +0000 > > >> >>>>>> n613214 (--first-parent --count for merge-base) > > >> >>>>>> =3D20 > > >> >>>>>> poudriere attempted to build 476 packages, starting > > >> >>>>>> with pkg (in order to build the 56 that I explicitly > > >> >>>>>> indicate that I want). It is my normal set of ports. > > >> >>>>>> The form of building is biased to allowing a high > > >> >>>>>> load average compared to the number of hardware > > >> >>>>>> threads (same as cores here): each builder is allowed > > >> >>>>>> to use the full count of hardware threads. The build > > >> >>>>>> =E2=82=AC=C3=8FL=E2=82=AC=E2=82=AC=E2=82=AC=E2=82=AC=E2=80=B9 > = > > > >> used USE_TMPFS=3D3D3D"data" instead of the > > >> >>>>>> USE_TMPFS=3D3D3Dall I > > >> >> normally use on the build machine involved. > > >> >>>>>> =3D20 > > >> >>>>>> And it produced some random errors during the attempted > > >> >>>>>> builds. A type of example that is easy to interpret > > >> >>>>>> without further exploration is: > > >> >>>>>> =3D20 > > >> >>>>>> pkg_resources.extern.packaging.requirements.InvalidRequirement: > > >> >>>>>> Parse > > >> >>>>>> =3D > > >> >> =3D3D > > >> >>>> error at "'\x00\x00\x00\x00\x00\x00\x00\x00'": Expected > > >> >>>> W:(0-9A-Za-z) > > >> >>>>>> 0 > > >> >> da0p8 ONLINE 0 0 0 > > >> >>>>>> =3D20 > > >> >>>>>> errors: No known data errors > > >> >>>>>> =3D20 > > >> >>>>>> =3D20 > > >> >>>>>> =3D3D3D=3D3D3D=3D3D3D > > >> >>>>>> Mark Millard > > >> >>>>>> marklmi at yahoo.com > > >> >>>>>> =3D20 > > >> >>>>> =3D20 > > >> >>>>> Let's try this again. Claws-mail didn't include the list address i= > > n > > >> >>>>> =3D > > >> >>>>> the=3D20 > > >> >>>> header. Trying to reply, again, using exmh instead. > > >> >>>>> =3D20 > > >> >>>>> =3D20 > > >> >>>>> Did your pools suffer the EXDEV problem? The EXDEV also corrupted = > > =3D > > >> >>>>> files. > > >> >>>> > > >> >>>> As I reported, this was a jump from before the import > > >> >>>> to as things are tonight (here). So: NO, unless the > > >> >>>> existing code as of tonight still has the EXDEV problem! > > >> >>>> > > >> >>>> Prior to this experiment I'd not progressed any media > > >> >>>> beyond: main-n261544-cee09bda03c8-dirty Wed Mar 15 20:25:49. > > >> >>>> > > >> >>>> I think, without sufficient investigation we risk jumping to > > >> >>>>> conclusions. I've taken an extremely cautious approach, rolling > > >> >>>>> back > > >> >>>>> snapshots (as much as possible, i.e. poudriere datasets) when EXDE= > > V > > >> >>>>> corruption was encountered. > > >> >>>>> > > >> >>>> Again: nothing between main-n261544-cee09bda03c8-dirty and > > >> >>>> main-n262122-2ef2c26f3f13-dirty was involved at any stage. > > >> >>>> > > >> >>>> =3D20 > > >> >>>>> I did not rollback any snapshots in my MH mail directory. Rolling > > >> >>>>> back > > >> >>>>> snapshots of my MH maildir would result in loss of email. I have t= > > o > > >> >>>>> live with that corruption. Corrupted files in my outgoing sent > > >> >>>>> email > > >> >>>>> directory remain: > > >> >>>>> =3D20 > > >> >>>>> slippy$ ugrep -cPa '\x00' ~/.Mail/note | grep -c :1=3D20 > > >> >>>>> 53 > > >> >>>>> slippy$=3D20 > > >> >>>>> =3D20 > > >> >>>>> There are 53 corrupted files in my note log of 9913 emails. Those = > > =3D > > >> >>>>> files > > >> >>>> will never be fixed. They were corrupted by the EXDEV bug. Any new > > >> >>>> ZFS > > >> >>>>> or ZFS patches cannot retroactively remove the corruption from > > >> >>>>> those > > >> >>>>> files. > > >> >>>>> =3D20 > > >> >>>>> But my poudriere files, because the snapshots were rolled back, > > >> >>>>> were > > >> >>>>> "repaired" by the rolled back snapshots. > > >> >>>>> =3D20 > > >> >>>>> I'm not convinced that there is presently active corruption since > > >> >>>>> the problem has been fixed. I am convinced that whatever corruptio= > > n > > >> >>>>> that was written at the time will remain forever or until those > > >> >>>>> files > > >> >>>>> are deleted or replaced -- just like my email files written to dis= > > k > > >> >>>>> at > > >> >>>>> the time. > > >> >>>>> > > >> >>>> My test results and procedure just do not fit your conclusion > > >> >>>> that things are okay now if block_clonging is completely avoided. > > >> >>>> > > >> >>> Admitting I'm wrong: sending copies of my last reply to you back to > > >> >>> myself, > > >> >>> > > >> >> again and again, three times, I've managed to reproduce the corruptio= > > n > > >> >> you > > >> >>> are talking about. > > >> >>> > > >> >> This email itself was also corrupted. Below is what was sent. Good > > >> >> thing > > >> >> multiple copies are saved by exmh. > > >> >> > > >> >> Admitting I'm wrong: sending copies of my last reply to you back to > > >> >> myself, > > >> >> again and again, three times, I've managed to reproduce the corruptio= > > n > > >> >> you > > >> >> are talking about. > > >> >> > > >> > This email itself was also corrupted. Below is what was sent. Good > > >> > thing > > >> > multiple copies are saved by exmh. > > >> > > > >> > Admitting I'm wrong: sending copies of my last reply to you back to > > >> > myself, > > >> > again and again, three times, I've managed to reproduce the corruption > > >> > you > > >> > are talking about. > > >> > > > >> > From my previous email to you. > > >> > > > >> > header. Trying to reply:::::::::, again, using exmh instead. > > >> > ^^^^^^^^^ > > >> > Here it is, nine additional bytes of garbage. I've replaced the garbag= > > e > > >> > with colons because nulls mess up a lot of things, including cut&paste= > > . > > >> > > > >> > In another instance about 500 bytes were removed. I can reproduce the > > >> > corruption at will now. > > >> > > > >> > The EXDEV patch is applied. Block_cloning is disabled. > > >> > > > >> > Somehow nulls and other garbage are inserted in the middle of emails > > >> > after > > >> > the ZFS upgrade. > > >> > > > >> Can you please try this patch: > > >> > > >> github.com > > > > > > The patch was applied yesterday at noon (PDT). > > > > > >> > > >> > > >> > > >> Unfortunately I don=E2=80=99t see how this can happen with block cloning > > >> disabled. > > > > > > It does and it's reproducible. > > > > > > > There is corruption with the recent import, with the > > https://github.com/openzfs/zfs/pull/14739/files patch applied and > > block cloning disabled on the pool. > > Same here. > > > > > There is no corruption with top of main with zfs merge reverted altogether. > > I'm in the process of building a branch reverting the merge altogether and > will test it on my sandbox machine later today. > > > > > Which commit results in said corruption remains to be seen, a variant > > of the tree with just block cloning support reverted just for testing > > purposes is about to be evaluated. I've learned over the years downstream that it's not really my place to tell upstream what to do or how to do it. However, I think given the seriousness of this, upstream might do well to revert the commit until a solid fix is in place. Upstream might want to consider the impacts this is having not just with downstream projects, but also regular users. Really bad timing to have a lot of new tax documentation that I really don't want to lose. I'd really like to have an up-to-date, security patched OS, but I guess I'll stay behind so that I don't risk losing critical financial documentation. Does the ZFS project have some sort of automated testing to catch data-gobbling, pool killing bugs? It seems like this would have been caught with some CI/CD stress testing automation scripts. Thanks, -- Shawn Webb Cofounder / Security Engineer HardenedBSD https://git.hardenedbsd.org/hardenedbsd/pubkeys/-/raw/master/Shawn_Webb/03A4CBEBB82EA5A67D9F3853FF2E67A277F8E1FA.pub.asc