+CONTENTS files

Mon Jul 2 15:10:10 UTC 2007

Alexander Leidinger wrote:
> Quoting Garrett Cooper <youshi10 at u.washington.edu> (from Mon, 02 Jul 
> 2007 00:55:25 -0700):
>
>> [LoN]Kamikaze wrote:
>>> Garrett Cooper wrote:
>>>
>>>> Pardon me for being naive, but wouldn't it be wiser for all of the 
>>>> data
>>>> in the +CONTENTS file to be aggregated into sections instead of having
>>>> line by line info?
>>>>
>>>> Example (net/samba_3.0.25a):
>>>>
>>>> @comment MD5:9e94560ac5e757d3bc5f922dcf3ab4fb
>>>> man/man1/log2pcap.1.gz
>>>> [~100 lines of repetitive data...]
>>>> @comment MD5:9f5fc8df2a1383a175e165ef2e0b10cc
>>>> man/man8/vfs_notify_fam.8.gz
>>>>
>>>>   Could be aggregated into:
>>>>
>>>> @MD5
>>>> 9e94560ac5e757d3bc5f922dcf3ab4fb man/man1/log2pcap.1.gz
>>>> c58f068d603a12d4af867c15cf77e636 man/man1/nmblookup.1.gz
>>>> [etc..]
>>>> @end MD5
>>>>
>>>>   or something similar to XML.
>>>>
>>>>   This would reduce the filesize from n bytes to n - (9 + 4 -1) *
>>>> i_entries + 8. In larger package files this would reduce the amount of
>>>> data parsing by a long shot. Also, more powerful scripting languages
>>>> like Perl, Python, or smart parsers in C could make short work of this
>>>> data and just extract the MD5 elements for comparison.
>>>>
>>>>   Also, by doing a little extra work when creating packages by
>>>> organizing all the sections together, I think that the file size could
>>>> be reduced by a large degree.
>>>>
>>>>   Similar fields to @comment MD5 could be reduced I believe, but with
>>>> less benefit maybe, other than just the @unexec rmdir, etc lines.
>>>>
>>>>   Either that, or the data should be organized into separate files I
>>>> think (increases number of files, but reduces overall processing 
>>>> time IMO).
>
>>> In some cases the order of data stored is important and thus it 
>>> cannot be
>>> seperated into section. Also, this layout allows for very simple  
>>> parsing with
>>> usual UNIX tools (sed, cut, awk, perl, simply everything). Unlike  
>>> XML, which is
>>> rather complex and thus does not belong into base, in my opinion.
>
> We have libbsdxml in the base already (an old version of one in the 
> ports).
Ok.
>>    I didn't say XML exactly. I say XML-like, with implied end and begin
>> tags, but keeping with the Makefile like syntax of @MD5 ... @end MD5,
>> or something similar.
>
> The problem is, that a change would break existing installations, as 
> they can not cope with such a new format. Feel free to propose 
> improvements, but you need to keep in your mind, that any supported 
> FreeBSD release has to be able to install packages with only the 
> package tools available in the basesystem.

The point is though that there's a lot of unnecessary bloat, which adds 
to longer text file sizes, and thus slows down smarter parsers written 
in C, Perl, or Python.

>>    My point being is that the +CONTENTS file is bloated a lot by
>> useless lines, and it would help speed up package processing if it was
>> clipped or reduced somehow I would think.
>
> You need to provide numbers. Without them this is pure speculation.
>
> And you have to explain, why the current parsing routines can not be 
> speed up for the current format, maybe the implementation is just a 
> little bit outdated compared to todays parsing knowledge...
>
> Bye,
> Alexander.
>

    Ok. I take your challenge and will have preliminary results in 2-3 
days. Are Excel formatted spreadsheets ok (thinking graphs)?
Thanks,
-Garrett