why GNU grep is fast
Dag-Erling Smørgrav
des at des.no
Sun Aug 22 19:53:40 UTC 2010
"Sean C. Farley" <scf at FreeBSD.org> writes:
> Some algorithms:
> 1. http://en.wikipedia.org/wiki/Aho-Corasick_string_matching_algorithm
Aho-Corasick is not really a search algorithm, but an algorithm for
constructing a table-driven finite state machine that will match either
of the search strings you fed it. I believe it is less efficient than
Boyer-Moore for small numbers of search terms, since it scans the entire
input. I don't see the point in using it in grep, because grep already
has an algorithm for constructing finite state machines: regcomp(3).
> 2. http://en.wikipedia.org/wiki/Rabin-Karp_string_search_algorithm
It doesn't seem to compare favorably to the far older Aho-Corasick. It
uses slightly less memory, but memory is usually not an issue with grep.
> 4. GLIMPSE: http://webglimpse.net/pubs/TR94-17.pdf (Boyer-Moore
> variant)
Glimpse is a POS... and not really comparable, because grep is designed
to search for a single search string in multiple texts, while glimpse is
designed to search a large amount of text over and over with different
search strings. I believe it uses suffix tables to construct its index,
and Boyer-Moore only to locate specific matches, since the index lists
only files, not exact positions. For anything other than fixed strings,
it reverts to agrep, but I assume (I haven't looked at the code) that if
the regexp has one or more fixed components, it uses those to narrow the
search space before running agrep.
DES
--
Dag-Erling Smørgrav - des at des.no
More information about the freebsd-current
mailing list