why GNU grep is fast

Sun Aug 22 19:53:40 UTC 2010

"Sean C. Farley" <scf at FreeBSD.org> writes:
> Some algorithms:
> 1. http://en.wikipedia.org/wiki/Aho-Corasick_string_matching_algorithm

Aho-Corasick is not really a search algorithm, but an algorithm for
constructing a table-driven finite state machine that will match either
of the search strings you fed it.  I believe it is less efficient than
Boyer-Moore for small numbers of search terms, since it scans the entire
input.  I don't see the point in using it in grep, because grep already
has an algorithm for constructing finite state machines: regcomp(3).

> 2. http://en.wikipedia.org/wiki/Rabin-Karp_string_search_algorithm

It doesn't seem to compare favorably to the far older Aho-Corasick.  It
uses slightly less memory, but memory is usually not an issue with grep.

> 4. GLIMPSE:  http://webglimpse.net/pubs/TR94-17.pdf (Boyer-Moore
> variant)

Glimpse is a POS...  and not really comparable, because grep is designed
to search for a single search string in multiple texts, while glimpse is
designed to search a large amount of text over and over with different
search strings.  I believe it uses suffix tables to construct its index,
and Boyer-Moore only to locate specific matches, since the index lists
only files, not exact positions.  For anything other than fixed strings,
it reverts to agrep, but I assume (I haven't looked at the code) that if
the regexp has one or more fixed components, it uses those to narrow the
search space before running agrep.

DES
-- 
Dag-Erling Smørgrav - des at des.no