Indexing a largish collection of mail and usenet messages?

Christophe Ollier c.ollier at
Tue Jan 2 02:34:32 PST 2007

John L a écrit :

> I have a collection of archives of mailing list and news messages. The 
> largest collection is pretty big, about 150,000 messages which means 
> about 200 megabytes of text, shortly to be migrated to a FreeBSD 
> server.  The lists are all active so archives typically add a few 
> messages each day. I want to provide a full text search of each 
> archive.  What software should I use?  I have been using the sturdy but 
> ancient lqtext package. It's OK, but it has a few bugs I have yet to 
> pick and I'm wondering if something better is available.

You could have a look at Lucene (<>) : a text 
search engine library written in Java. I don't know lqtext, but Lucene 
seems to work in a similar way : a first program builds & updates an 
index, a second program allows to query the index.

It's "only" a library, you have to program the interfaces for you 
(indexing) and your users (querying). There are numerous ports to other 
languages (C, Perl, Python, PHP (through ZendFramework) are in the ports 

> First, I am NOT, repeat NOT, asking about web spiders.  The messages are 
> directly available to indexing software as files on my server, so 
> there's no advantage to running them through Apache on the way to the 
> indexer. Also, the messages in the archive never change and I know what 
> files are new each day, so it would be pointless for a package to 
> re-spider the whole archive to look for the new messages.  I am not 
> unalterably opposed to something that spiders if it is otherwise 
> wonderful, but that approach hasn't been fruitful in the past.

Lucene can update an existing index with new documents.

> What I want ideally is something that knows enough about the structure 
> of mail messages to deal intelligently with headers vs. body, that can 
> do something reasonable with MIME and HTML bodies (not urgent, I can 
> always run them through demime on the way to the index), and most 
> importantly that actually works with 150,000 messages.  I've seen lots 
> of packages that look promising but that fall over dead once they get 
> past 10,000 messages or so.

I don't think Lucene can do this out of the box, but you can associate 
any keyword to your indexed documents (e.g. mail headers).

About performance, I'm personally satisfied. I use the PHP port, with 
20k documents, the full index takes about an hour to build, queries 
about 100 to 1000 ms. Lucene seems fit for millions of documents.

> [...]


More information about the freebsd-questions mailing list