youshi10 at u.washington.edu
Mon Jul 2 15:10:10 UTC 2007
Alexander Leidinger wrote:
> Quoting Garrett Cooper <youshi10 at u.washington.edu> (from Mon, 02 Jul
> 2007 00:55:25 -0700):
>> [LoN]Kamikaze wrote:
>>> Garrett Cooper wrote:
>>>> Pardon me for being naive, but wouldn't it be wiser for all of the
>>>> in the +CONTENTS file to be aggregated into sections instead of having
>>>> line by line info?
>>>> Example (net/samba_3.0.25a):
>>>> @comment MD5:9e94560ac5e757d3bc5f922dcf3ab4fb
>>>> [~100 lines of repetitive data...]
>>>> @comment MD5:9f5fc8df2a1383a175e165ef2e0b10cc
>>>> Could be aggregated into:
>>>> 9e94560ac5e757d3bc5f922dcf3ab4fb man/man1/log2pcap.1.gz
>>>> c58f068d603a12d4af867c15cf77e636 man/man1/nmblookup.1.gz
>>>> @end MD5
>>>> or something similar to XML.
>>>> This would reduce the filesize from n bytes to n - (9 + 4 -1) *
>>>> i_entries + 8. In larger package files this would reduce the amount of
>>>> data parsing by a long shot. Also, more powerful scripting languages
>>>> like Perl, Python, or smart parsers in C could make short work of this
>>>> data and just extract the MD5 elements for comparison.
>>>> Also, by doing a little extra work when creating packages by
>>>> organizing all the sections together, I think that the file size could
>>>> be reduced by a large degree.
>>>> Similar fields to @comment MD5 could be reduced I believe, but with
>>>> less benefit maybe, other than just the @unexec rmdir, etc lines.
>>>> Either that, or the data should be organized into separate files I
>>>> think (increases number of files, but reduces overall processing
>>>> time IMO).
>>> In some cases the order of data stored is important and thus it
>>> cannot be
>>> seperated into section. Also, this layout allows for very simple
>>> parsing with
>>> usual UNIX tools (sed, cut, awk, perl, simply everything). Unlike
>>> XML, which is
>>> rather complex and thus does not belong into base, in my opinion.
> We have libbsdxml in the base already (an old version of one in the
>> I didn't say XML exactly. I say XML-like, with implied end and begin
>> tags, but keeping with the Makefile like syntax of @MD5 ... @end MD5,
>> or something similar.
> The problem is, that a change would break existing installations, as
> they can not cope with such a new format. Feel free to propose
> improvements, but you need to keep in your mind, that any supported
> FreeBSD release has to be able to install packages with only the
> package tools available in the basesystem.
The point is though that there's a lot of unnecessary bloat, which adds
to longer text file sizes, and thus slows down smarter parsers written
in C, Perl, or Python.
>> My point being is that the +CONTENTS file is bloated a lot by
>> useless lines, and it would help speed up package processing if it was
>> clipped or reduced somehow I would think.
> You need to provide numbers. Without them this is pure speculation.
> And you have to explain, why the current parsing routines can not be
> speed up for the current format, maybe the implementation is just a
> little bit outdated compared to todays parsing knowledge...
Ok. I take your challenge and will have preliminary results in 2-3
days. Are Excel formatted spreadsheets ok (thinking graphs)?
More information about the freebsd-ports