Second "RFC" on pkg-data idea for ports

Sat Apr 17 11:40:56 PDT 2004

On Mon, 12 Apr 2004, Garance A Drosihn wrote:

> Back in January I send out a long-ish email asking for feedback
> on some ideas I had for the ports-collection [...]
> The basic idea is to collapse many of the separate files for a
> port into a single pkg-data file.  The web pages explain why I
> think this might be worth doing.  Please check them out at:
> 
> http://people.freebsd.org/~gad/PkgData/

My reaction to this proposed changed is:

  I suppose it depends on what problem you're really trying to solve.

You've mentioned (in a later maessage than this one) that you have some
ideas about future directions that could spring from this work, but that
they are not yet fully-formed enough to be written down.  While that's
fair enough, until that's done, it's really hard to weigh the tradeoffs
involved in doing all this (IMHO) extensive work.  But if that's the
case, then what you're trying to address is not just the inodes problem.

Lacking that, what we have is a proposal to address the inodes problem.

Assuming, for the sake of argument, that the number of inodes is
affecting a large-enough class of people, let's think about some
alternative ways to address that, that might involve less reworking
of the infrastructure.  (In the colorful folk saying, "don't raise
the bridge -- lower the waster").

1. (easy) If the distinfo lines were moved into the Makefiles, that
would result in a savings of 9568 files out of 10149 ports (60075 files),
for about 16%.  (Note: I'm using the numbers from an old tree, but the
percentage has probably not changed significantly).

(Disclaimer: although I personally am not really fond of this solution due
to the repo-churn it would create, I know that other people are pushing
for this to be done).

2. (intermediate) Let's change the way we think about patchfiles.
Instead of seeing them as a permanent part of the port, perhaps we
should instead be thinking about each one as a temporary measure until
we can get the original software's authors to incorporate them upstream.

Now, there's no question that working through each and every port,
sending email to its author(s) (if, indeed, the software is even still
being maintained), is nowhere near as exciting or fun than reworking
infrastructure :-) .  However, think of the benefits: getting the
patches incorporated upstream means less work for each individual port
maintainer during each port update; also, in many cases, the patches
will help out maintainers on other OSes (in particular, the other BSDs,
but the gcc3.3 patches and patches for 64-bit problems will also help
Debian and some of the other Linux distributions.  In this scenario,
everybody wins.)

3. (advanced) Right now our default assumption is that to install any
ports,  you have to install the entire ports collection.  This is true
whether you install ports via downloading and unzipping the tarball
from our main site, or use cvsup.  Perhaps it's time to reevaluate this
assumption.

Right now, some of our ports tools rely on having an up-to-date INDEX
file, and since it is updated much more rarely than ports are added,
moved, or deleted, that implies needing the ability to generate the
INDEX file locally -- and, due to the cross-dependencies between ports
categories, generating that file doesn't work if you don't have all
the Makefiles.  (There are exceptions to this: a few categories really
are 'leaf categories' but they're fewer in number than you might
suspect: most (but not all!) of the language ports; and, IIRC, astro,
benchmarks, biology, finance, mbone, picobsd, and maybe x11-themes).

3a. (hard) figure out some way to do away with the INDEX file.  This
probably means creating some kind of Berkeley db-based solution (AFAIK
that's the only database included in the base system.)  As you are learning,
getting consensus on what type of technology to bring in is not so easy ...
that's why I list this as "hard".  Nevertheless, I think it would be an
interesting line of research, but at my current rate I myself will not be
getting around to it, myself, for months.

3b. (somewhat easier) Figure out ways to not have to have the entire
hierarchy loaded.  The way that has occurred to me to do this is to
figure out which ports in which categories require which other ports
from which other categories.  My first attempt to do this, that led to
the conclusions about "leaf categories" above, was just some sh scripts,
and although informative, led me to the conclusion that the gain from
partitioning out the "easy cases" was on the order of 9% of the inodes.
I haven't pursued it further, because 9% didn't sound super-attractive
to me; but again, I do not see the inodes as quite so pressing a problem
in the first place, so maybe it's worth doing regardless ...

But the only way to get more than that 9% gain is to understand the
cross-category dependencies, to lead to possible further repartitioning
of the tree (really, only the filesystem is a tree; the dependencies are
a very messy graph).

(As an example, my other conclusion from that shell-script run was
"everything depends on devel, and devel depends on everything else".
Since devel has 1184 ports in it, it's difficult to attack the overall
problem without attacking devel ...)

I honestly don't think anyone in the FreeBSD project really has a handle
on what that dependency graph looks like.  And this is where I think your
desire to have someone work on the inodes problem, who doesn't have an
intricate knowledge of coding to the existing infrastructure, could be
invaluble.

There are various ports in the tree (graphics/graphviz; graphics/
meshviewer; graphics/vcg) that might be really useful to shed some
light on the data structures.  To my knowledge, no one has ever done
this for the FreeBSD ports, if, indeed, for any of the various
open-source OSes at all.  Since these things take data-file input,
they might not need heavy-duty programming experience to come up with
useful results.

So this is where I'd like to suggest that some work by a dedicated
volunteer could produce some immediate short-term results that would
help out the users: this would help us to define what the underlying
problem actually _is_ that we're trying to solve.

And that's always a good thing.

mcl