why GNU grep is fast

Mon Aug 23 07:56:33 UTC 2010

"Sean C. Farley" <scf at FreeBSD.org> writes:
> Dag-Erling Smørgrav <des at des.no> writes:
> > Aho-Corasick is not really a search algorithm, but an algorithm for
> > constructing a table-driven finite state machine that will match
> > either of the search strings you fed it.  I believe it is less
> > efficient than Boyer-Moore for small numbers of search terms, since
> > it scans the entire input.  I don't see the point in using it in
> > grep, because grep already has an algorithm for constructing finite
> > state machines: regcomp(3).
> especially those that could be useful for fgrep functionality

Not entirely sure what you mean by that, but in most cases, I think
Boyer-Moore is a better fit for fgrep.  For large numbers of search
terms, I might use Aho-Corasick...  if it weren't for the fact that, as
mentioned, grep already has a regexp engine, which is a superset of
Aho-Corasick.  It might be a tad slower at building the FSA, because
it's more general, and the FSA might occupy a tad more memory, but the
complexity is *exactly* the same, and adding Aho-Corasick just for fgrep
is, for lack of a better word, pedantry.  For all you know, the
increased code size could very well offset whatever performance
advantage Aho-Corasick might offer.

> Glimpse may be a POS; I have not used it personally.  I only noted its
> algorithm for possible use within fgrep.

Apples and oranges.  The problem glimpse tries to solve (fixed corpus,
variable search terms) is precisely the opposite of the one fgrep tries
to solve (fixed search terms, variable corpus).

Glimpse does include grep-like functionality, but in the form of agrep,
which is designed for approximate matching.

DES
-- 
Dag-Erling Smørgrav - des at des.no