[BRAINSTORMING] simplifying maintainer's life

Fri Oct 10 06:26:24 UTC 2014

On, Thu Sep 04, 2014, Marcus von Appen wrote:

>
> Matthew Seaman <matthew at freebsd.org>:
>
> > On 04/09/2014 07:00, Marcus von Appen wrote:
> >>> - I often grep all plists to find which port could possibly provide such
> >>> >   header or such library (among non-installed ports, of course).
> >
> >> I do the same, but would argue that such a query service should belong to or
> >> offered by a pkg search (as sort of counterpart to pkg which).
> >
> > We've toyed with that idea -- allowing 'pkg search' or similar to search
> > on any file in any package known in the repositories.  The biggest
> > problem is that including all that data in the package catalogues would
> > bloat their size by a very large amount.
> >
> > Rather than bloating the catalogues for any use, there was a separate
> > index of files.  Not sure whether that's being routinely built on the
> > FreeBSD pkg cluster at the moment  -- probably not, as it was only ever
> > experimental, and didn't have any generally available consumers.
>
> I did not mean it to be offline available, since it becomes outdated too fast.
>
> > In many ways, I'd prefer to have this sort of functionality available as
> > a web-app, thus saving users the necessity of downloading megabytes of
> > data about ports / packages they would never use or care about.  Needs
> > someone to step up and write that application though.
>
> Not necessarily a web app, but a (web) service that's e.g. run somewhere on a
> pkg builder or proxy and which can be queried by tools as well as web
> services.
>

I gave the pkg repo output a quick shot with about 370 random packages, which
creates a filesite.yaml of roughly 4.5 MB. Reading the file and transforming
it into a tree to get access to the file entries resulted in about 80000 nodes
(a compressed variant, where nodes with only a single child are merged,
contains 78000 nodes).

370 packages are about 1 to 1.5 percent of the package amount, we currently
have. Assuming linear growth, filesite.yaml would be around 450 MB in size,
and a node tree for searching would contain around 8 million nodes.

I doubt that a reasonably fast search service could be implemented on top of
filesite.yaml alone. Storing everything in memory is not an option, since the
index tree alone would consume far more than 300 MB (assuming an optimal word
size of 20 bytes plus a bit of node and tree payload). Fragemented search over
the file would cause heavy disk I/O and no matter, how many threads will
perform the search, the disk I/O eventually will become the bottleneck.

Searching the tree would be horribly slow, since the traversal would need
either additional information in the structure (to avoid a complete BFS/DFS
for file fragments) or keeping subtrees and helper structures in memory, which
easily blows the minimum memory amount to use without a single query being
executed.

With this amount of information, Tries, DAWGs or generic DAGs would easily hit
their limit and one would need to set up an incremental search based on
seperate indices, which effectively leads to a document index and search
implementation.

My guess is that a webservice for searching the catalogue would be easier to
be implemented based on a full text search engine, such as Solr or Lucene,
since each particular entry within filesite.yaml is a specific and very small
document.

Cheers
Marcus
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 181 bytes
Desc: not available
URL: <http://lists.freebsd.org/pipermail/freebsd-ports/attachments/20141010/a13c9689/attachment.sig>