why GNU grep is fast

Sun Aug 22 17:47:46 UTC 2010

On Aug 22, 2010, at 9:30 AM, Sean C. Farley wrote:
> On Sun, 22 Aug 2010, Dag-Erling Smørgrav wrote:
>> Mike Haertel <mike at ducky.net> writes:
>>> GNU grep uses the well-known Boyer-Moore algorithm, which looks first for the final letter of the target string, and uses a lookup table to tell it how far ahead it can skip in the input whenever it finds a non-matching character.
>> 
>> Boyer-Moore is for fixed search strings.  I don't see how that optimization can work with a regexp search unless the regexp is so simple that you break it down into a small number of cases with known length and final character.
> 
> When I was working on making FreeGrep faster (years ago), I wrote down a few notes about possible algorithms, especially those that could be useful for fgrep functionality.  I am just passing these onto the list.
> 
> Some algorithms:
> 1. http://en.wikipedia.org/wiki/Aho-Corasick_string_matching_algorithm
> 2. http://en.wikipedia.org/wiki/Rabin-Karp_string_search_algorithm
> 3. GNU fgrep:  Commentz-Walter
> 4. GLIMPSE:  http://webglimpse.net/pubs/TR94-17.pdf (Boyer-Moore variant)
> 
> Also, this may be a useful book:
> http://www.dcc.uchile.cl/~gnavarro/FPMbook/

And of course, Russ Cox' excellent series of articles starting at:

  http://swtch.com/~rsc/regexp/regexp1.html

Later on, he summarizes some of the existing implementations,
including comments about the Plan 9 implementation and his own RE2, both
of which efficiently handle international text (which seems to
be a major concern of Gabor's).

The key comment in Mike's GNU grep notes is the one about not breaking
into lines.  That's simply double-scanning the input; instead, run
the matcher over blocks of text and, when it finds a match, work
backwards from the match to find the appropriate line beginning.
This is efficient because most lines don't match.

Boyer-Moore is great for fixed strings (a very common use case
for grep) and for more complex patterns that contain long fixed
strings (helps to discard most lines early).  Sophisticated regex
matchers implement a number of strategies and choose different
ones depending on the pattern.

In the case of bsdgrep, it might make sense to use the regex library
for the general case but implement a hand-tuned search for
fixed strings that can be heavily optimized for that case.
Of course, international text support complicates the picture;
you have to consider the input character set (if you want to auto-detect
Unicode encodings by looking for leading BOMs, for example, you
either need to translate the fixed-string pattern to match the input
encoding or vice-versa).

Tim