What to learn from the BSD grep case [Was: why GNU grep is fast]
Gabor Kovesdan
gabor at FreeBSD.org
Mon Aug 23 15:04:12 UTC 2010
Hi all,
there are some consequences that we can see from the grep case. Here I'd
like to add a summary, which raises some questions. All comments are
welcome.
1, When grep entered -CURRENT and bugs were found I immediately got kind
bug reports and sharp criticism, as well. According to my understanding,
-CURRENT is for development and it's fine to expose new pieces of work
there but now I'm in doubt about that because of complaining people. On
the other hand, an earlier version of BSD grep has been in the ports
tree for a very long time and users reported some problems, which have
been fixed but still, there is a lot of bugs there which haven't been
reported that time. If users don't volunteer to test new pieces of code
on a volunteer basis, somehow we have to make them test it, so I think
committing BSD grep to -CURRENT was a good decision in the first round.
2, This issue also brought up some bottlenecks and potential
optimization points (like memchr() and mmap), which other softwre may
benefit from. This is another reason to let such pieces of work in. But
unfortunately, this means that noone profiled another utilities because
these bottlenecks remained undiscovered. Neither did I. It's a lesson
that we have to learn from this particular case.
3, Because of point 2, we need more content to developers-handbook to
help development with such ideas and best practices. It has been also
raised on another list that our end-user documentation isn't that shiny
and cool that it used to be and actually, developers-handbook has never
been "finished" to be more or less complete. If someone looks at it, it
looks like a sketch, not a book. I'll see if I can write a section on
profiling.
4, We really need a good regex library. From the comments, it seems
there's no such in the open source world. GNU libregex isn't efficient
because GNU grep uses those workarounds that Mike kindly pointed out.
Oniguruma was extremely slow when I checked it. PCRE supports Perl-style
syntax with a POSIX-like API but not POSIX regex. Google RE2 is the same
with additional egrep syntax but doesn't have support for standard POSIX
regexes. Plan 9 regex only supports egrep syntax. It seems that TRE is
the best choice. It is BSD-licensed, supports wchar and POSIX(ish)
regexes and it is quite fast. I don't know the theoretical background of
regex engines but I'm wondering if it's possible top provide an
alternative API with byte-counted buffers and use the heuristical
speedup with fixed string matching. As Mike pointed out the POSIX API is
quite limiting because it works on NUL-terminated strings and not on
byte-counted buffers, so we couldn't just do it with a POSIX-conformant
library but it would be nice if we could implement it in such a library
with an alternative interface.
Gabor
More information about the freebsd-current
mailing list