PERFORCE change 176831 for review

Wed Apr 14 07:53:27 UTC 2010

On Tue, Apr 13, 2010 at 8:31 PM, Tim Kientzle <kientzle at freebsd.org> wrote:
>>> ... I would also put in a simple verification
>>> that the entry you found is a regular entry:
>>>  archive_entry_filetype(entry) == AE_IFREG
>>
>> Ok. What does a regular entry correspond to (I assume not a regular file)?
>
> Sorry, "regular entry" == "regular file" (or at least, what
> will become a regular file once you write it to disk, if you
> want to be really pedantic about it ;-).  Coincidentally,
> AE_IFREG == S_IFREG.  The separate AE_IFxxx constants
> are there mostly portability shims (e.g., Windows doesn't
> have block devices, so doesn't define S_IFBLK,
> but AE_IFBLK is always available).

Ok. I think I got everything you were trying to say before, but just
to be clear:

To prevent race conditions [with regular files only], I should use
open(2). For all other files I would use the standard extract method
(so I don't get too fancy?). It seems like the file creation times
should be sufficiently fast with other file types that this kind of
behavior of open(2), blah wouldn't be required.

>>>> + * NOTE: the exit code is 0 / 1 so that this can be fed directly into
>>>> exit
>>>> + * when doing piped tar commands for copying hierarchies *hint*,
>>>> *hint*.
>>>
>>> Why do you want to copy hierarchies?
>>> Seems a waste of disk bandwidth.  *hint* *hint* ;-)
>>
>> There's a fair amount of tar cpf - -C <dir_a> . | tar xpf - -C <dir_b>
>> in libinstall . This is being done to preserve file attributes,
>> fflags, modes, ownership, permissions, etc. ... Installing directly to
>> hierarchies works in theory as well, but it's
>> considerably more dangerous unless you do an information transfer
>> between the pkg contents and the install base, and unfortunately that
>> is tough to achieve with 100% clarity from what I've seen.
>
> That's exactly what I was getting to:  Someday, pkg_add
> should write the files directly to their final location
> and avoid the copy.  (You're exactly right that you
> don't want to build a "copy tree" facility.)
>
> Clearly, direct install is not something you want to
> tackle in this stage of the rewrite.  There's a lot of
> gains from exactly what you're doing.

Yeah, I've been thinking about what you said... and there's ultimately
nothing to be gained through a staged install (in fact there's a lot
lost). So unless two files overwrite one another (which suggest broken
or conflicting packages, or not so clever users), people shouldn't be
installing conflicting packages or partially installing multiple
versions of the same package unless they're using -f, and if they're
doing that, they're basically damning the consequences of doing so.

Until we have functional tests in place to actually test this though,
I need to stick with a conservative, functional analog to piped
tar(1)'ing though.

> But I really do believe that single-pass direct
> install is feasible and is eventually where we want
> to be.  The key insight for me was when Florent recently
> pointed out that you could just read the +CONTENTS,
> then do a verify pass, then extract everything.
> (Any conflict during the extraction pass
> would be a fatal error:  delete everything
> extracted so far and scream loudly.)

Agreed on the final point. I'm kind of interested though about the
first point -- the verify pass -- are you verifying that the contents
are sane on the disk or in the tarball? If so, why not just verify the
contents after the fact [with mtree -- it's has checksumming, mode and
permissions checking, you name it... everything minus some special
facilities like ACLs, fflags, etc -- I guess I'll need to add those],
then roll everything back if it isn't?

One the entire operation is atomic and exclusive -- only one package
operation is allowed to run at any given time on a particular package,
running into this problem with the packaging software will be a
non-issue, but instead a problem with an external source (user, etc),
or a bad package.

> Dealing with dependent packages is still
> a little tricky but I think it's all doable.

Yeah, there's a crapload of overlap between multiple packages, like
cad-salome, and with some packages where you're need to build up a
hierarchy that's common between several packages, it becomes a royal
pain in the ass, but it's required otherwise stuff won't get setup and
torn down properly.

So I'm guessing for right now we should just avoid squealing loudly
except with regular files; maybe whining a bit with non-regular files
would be a good idea though because it would force people to clean up
packages, and then things could be completely fixed later on so that
they're hard errors.

>> I saw that in the original comments and while I think that you have a
>> point, I still am leery about someone specifying a package with
>> contents that start with + because it immediately breaks the
>> assumption. ...
>
> Unless I'm mistaken, pkg_add has used "+*" to identify
> metadata files for a long time, so I think your own
> argument from compatibility might be against you here.
> (Though I suspect that you're right in the long term that
> we should be obeying the +CONTENTS declarations instead of
> using filename conventions.)

I still want to avoid this potential corner case. Users have an
incapability to surprise developers, and I don't want to have to deal
with this item in the future. A few less CPU cycles processing strings
is worth sacrificing over trying to be clever and producing a broken
program.

>> unpack_file, blah with the appropriate metadata filenames considering
>> that it's a lot quicker than having to go through the entire tarball
>> end to end (especially if it's a large tarfile like openoffice),
>> unless someone unfortunately put the files at the end of the tarball
>> (in that case they do need to modify how the package is created).
>
> As for the speed argument:  Do you really need to
> extract the metadata files before you extract
> everything else?  Those files are going to a
> different place, but that's not an issue with a
> libarchive-driven extraction.  At one point, I had
> convinced myself that you could do it all in one
> pass, just writing the files to different places
> depending on whether they were metadata files
> (however identified) or not.  I recall, for example,
> that the +MTREE file needs to be used only at the
> end of the extraction (to fix-up permissions).

Yes. pkg_info's primary uses (not -W) read the metadata if the
metadata isn't on the disk already in $PKGDBDIR; it's an optimization,
and I don't want to reduce performance in pkg_info.

Other than that, pkg_install really doesn't care one iota about those
files (minus +CONTENTS).

>> Thanks again for the comments :)... it would be helpful BTW if the
>> manpages linked together because archive_read(3) doesn't reference the
>> archive_read_disk(3) APIs, etc, so I kind of have to fish for the
>> right APIs to use, and I feel like I'm catching more boots than fish
>> sometimes... but it's a learning experience and I don't expect to get
>> it right the first time.
>
> Please let me know any doc shortcomings; the documentation
> could use a lot of work, I know.  I also have a growing
> Examples page on the libarchive Wiki and conversations
> like this help me a lot to better understand what needs
> to go there.

Will definitely help whenever I can :).

Also, getting back to a topic you mentioned earlier, the issue with
creating the decompressor multiple times can be alleviated like so:

Make the basic extract API follow a format like so:

struct package_archive;

struct package_archive {
    /*
     * File descriptor for the archive. Set to NULL if you don't want
to reuse the
     * descriptor, i.e. want to only use the initializer once.
     */
    int *pkg_fd;
    /*
     * Compare a package filename vs an expression.
     *
     * Could be a small wrapper above fnmatch (for simple glob expressions),
     * strncmp (for exact matches -- would be set to PATH_MAX),
glob(3), or it could
     * always return 0 (for match all). Would return 0 if the files
matched, non-zero
     * if they didn't match.
     *
     * This would help us avoid hardcoding and at the small cost of a stack
     * push, pop, the additional overhead would be fairly negligible I would
     * think, and could potentially be optimized out using a smart compiler