why GNU grep is fast

Sun Aug 22 17:31:50 UTC 2010

On Aug 22, 2010, at 8:02 AM, Garrett Wollman wrote:
> In article <86k4nikglg.fsf at ds4.des.no> you write:
>> Mike Haertel <mike at ducky.net> writes:
>>> GNU grep uses the well-known Boyer-Moore algorithm, which looks
>>> first for the final letter of the target string, and uses a lookup
>>> table to tell it how far ahead it can skip in the input whenever
>>> it finds a non-matching character.
>> 
>> Boyer-Moore is for fixed search strings.  I don't see how that
>> optimization can work with a regexp search unless the regexp is so
>> simple that you break it down into a small number of cases with known
>> length and final character.
> 
> The common case of regexps used in the "grep" utility (and, for
> obvious reasons, universal in the "fgrep" utility) is fixed-length
> search strings.  Even non-fixed-length regexps typically consist of
> one one or two variable-length parts.

This is an important point:  A good grep implementation will
use different strategies depending on the input regexp.
Fixed-string matching is a very important special case.

>  Matching a completely
> variable-length regexp is just hard, computationally, 

See Russ Cox' articles for why this is not true.  It does require
considerable sophistication to build an efficient DFA but the
actual matcher, once built, can run very fast indeed:

  http://swtch.com/~rsc/regexp/

Tim