What to learn from the BSD grep case [Was: why GNU grep is fast]

Mon Aug 23 15:04:12 UTC 2010

  Hi all,

there are some consequences that we can see from the grep case. Here I'd 
like to add a summary, which raises some questions. All comments are 
welcome.

1, When grep entered -CURRENT and bugs were found I immediately got kind 
bug reports and sharp criticism, as well. According to my understanding, 
-CURRENT is for development and it's fine to expose new pieces of work 
there but now I'm in doubt about that because of complaining people. On 
the other hand, an earlier version of BSD grep has been in the ports 
tree for a very long time and users reported some problems, which have 
been fixed but still, there is a lot of bugs there which haven't been 
reported that time. If users don't volunteer to test new pieces of code 
on a volunteer basis, somehow we have to make them test it, so I think 
committing BSD grep to -CURRENT was a good decision in the first round.

2, This issue also brought up some bottlenecks and potential 
optimization points (like memchr() and mmap), which other softwre may 
benefit from. This is another reason to let such pieces of work in. But 
unfortunately, this means that noone profiled another utilities because 
these bottlenecks remained undiscovered. Neither did I. It's a lesson 
that we have to learn from this particular case.

3, Because of point 2, we need more content to developers-handbook to 
help development with such ideas and best practices. It has been also 
raised on another list that our end-user documentation isn't that shiny 
and cool that it used to be and actually, developers-handbook has never 
been "finished" to be more or less complete. If someone looks at it, it 
looks like a sketch, not a book. I'll see if I can write a section on 
profiling.

4, We really need a good regex library. From the comments, it seems 
there's no such in the open source world. GNU libregex isn't efficient 
because GNU grep uses those workarounds that Mike kindly pointed out. 
Oniguruma was extremely slow when I checked it. PCRE supports Perl-style 
syntax with a POSIX-like API but not POSIX regex. Google RE2 is the same 
with additional egrep syntax but doesn't have support for standard POSIX 
regexes. Plan 9 regex only supports egrep syntax. It seems that TRE is 
the best choice. It is BSD-licensed, supports wchar and POSIX(ish) 
regexes and it is quite fast. I don't know the theoretical background of 
regex engines but I'm wondering if it's possible top provide an 
alternative API with byte-counted buffers and use the heuristical 
speedup with fixed string matching. As Mike pointed out the POSIX API is 
quite limiting because it works on NUL-terminated strings and not on 
byte-counted buffers, so we couldn't just do it with a POSIX-conformant 
library but it would be nice if we could implement it in such a library 
with an alternative interface.

Gabor