why GNU grep is fast
Sean C. Farley
scf at FreeBSD.org
Mon Aug 23 01:37:15 UTC 2010
On Sun, 22 Aug 2010, Tim Kientzle wrote:
> On Aug 22, 2010, at 9:30 AM, Sean C. Farley wrote:
>> On Sun, 22 Aug 2010, Dag-Erling Smørgrav wrote:
>>> Mike Haertel <mike at ducky.net> writes:
>>>> GNU grep uses the well-known Boyer-Moore algorithm, which looks
>>>> first for the final letter of the target string, and uses a lookup
>>>> table to tell it how far ahead it can skip in the input whenever it
>>>> finds a non-matching character.
>>>
>>> Boyer-Moore is for fixed search strings. I don't see how that
>>> optimization can work with a regexp search unless the regexp is so
>>> simple that you break it down into a small number of cases with
>>> known length and final character.
>>
>> When I was working on making FreeGrep faster (years ago), I wrote
>> down a few notes about possible algorithms, especially those that
>> could be useful for fgrep functionality. I am just passing these
>> onto the list.
>>
>> Some algorithms:
>> 1. http://en.wikipedia.org/wiki/Aho-Corasick_string_matching_algorithm
>> 2. http://en.wikipedia.org/wiki/Rabin-Karp_string_search_algorithm
>> 3. GNU fgrep: Commentz-Walter
>> 4. GLIMPSE: http://webglimpse.net/pubs/TR94-17.pdf (Boyer-Moore variant)
>>
>> Also, this may be a useful book:
>> http://www.dcc.uchile.cl/~gnavarro/FPMbook/
>
> And of course, Russ Cox' excellent series of articles starting at:
>
> http://swtch.com/~rsc/regexp/regexp1.html
I saved that link from an E-mail earlier because it looked very
interesting.
> Later on, he summarizes some of the existing implementations,
> including comments about the Plan 9 implementation and his own RE2,
> both of which efficiently handle international text (which seems to be
> a major concern of Gabor's).
I believe Gabor is considering TRE for a good replacement regex library.
> The key comment in Mike's GNU grep notes is the one about not breaking
> into lines. That's simply double-scanning the input; instead, run the
> matcher over blocks of text and, when it finds a match, work backwards
> from the match to find the appropriate line beginning. This is
> efficient because most lines don't match.
I do like the idea.
> Boyer-Moore is great for fixed strings (a very common use case for
> grep) and for more complex patterns that contain long fixed strings
> (helps to discard most lines early). Sophisticated regex matchers
> implement a number of strategies and choose different ones depending
> on the pattern.
That is what fastgrep (in bsdgrep) attempts to accomplish with very
simply regex lines (beginning of line, end of line and dot).
> In the case of bsdgrep, it might make sense to use the regex library
> for the general case but implement a hand-tuned search for fixed
> strings that can be heavily optimized for that case. Of course,
> international text support complicates the picture; you have to
> consider the input character set (if you want to auto-detect Unicode
> encodings by looking for leading BOMs, for example, you either need to
> translate the fixed-string pattern to match the input encoding or
> vice-versa).
BTW, the fastgrep portion of bsdgrep is my fault/contribution to do a
faster search bypassing the regex library. :) It certainly was not
written with any encodings in mind; it was purely ASCII. As I have not
kept up with it, I do not know if anyone improved it or not.
Sean
--
scf at FreeBSD.org
More information about the freebsd-current
mailing list