[RFC] Replacing our regex implementation
Gabor Kovesdan
gabor at kovesdan.org
Mon May 9 00:47:24 UTC 2011
Hi Folks,
I've been given the opportunity to work in GSoC 2011 to replace our base
regex library with a more modern one and given that the regex code is
something essential probably there are lots of interested parties so I
decided to open a thread here about my plans and the approach that I
propose. My wiki page is here, you can also take a look at it:
http://wiki.freebsd.org/G%C3%A1borSoC2011
So let's see what this project is about... First some background
information.
Why to replace?
- it is not efficient
- it does not support wide characters
- it comes from old and unmaintained vendor code
What to use instead?
There are some free implementations around but not all of the suit our
needs.
- PCRE is not POSIX regex, it only has a POSIX interface but actually
the regex syntax is not the same
- Oniguruma is slooooooooooooow
- The Plan9 implementation doesn't support full POSIX regexes
- TRE is BSD-licensed, fast, supports wchar, well-documented and still
actively maintained, so it is the best candidate
Tasks to complete (in priority order)
1, Replace the libc code with TRE code. Should be quite straightforward
given that TRE is quite mature and has good POSIX-compliance. At the
moment I'm already working on this. So far, I've discovered two
differences from our current implementation
- It accepts leaving at the minimum repetition count in curly brackets
and inferres 0 instead of them. E.g. {,8} is equivalent to {0,8} and {,}
is equivalent to {0,}. I think it won't be a problem because it is more
permissive than our current implementation so the supported expressions
are a superset of the current ones not a subset.
- It doesn't provide the REG_STARTEND macro, which is our non-POSIX
extension. Still, it is useful and easy to implement so it is not a
problem either.
Now I'm working on this little feature and on building a libc with TRE.
After that I'll publish a patch for testing and will also ask portmgr to
run it on the cluster.
2, Optimizations for matching with a fixed pattern heuristic
Fixed string matching is algorithmically much cheaper than regex. Some
time ago the GNU grep author was so kind to me that he provided lots of
valuable comments while GNU grep is fast and one of those was this
heuristical matching. You can see that mail here:
http://lists.freebsd.org/pipermail/freebsd-current/2010-August/019358.html
Although TRE has some alternative interfaces to match with an explicitly
specified length instead of using NUL-terminated strings, the
performance has to be measured and if it's not efficient enough, I plan
to implement such optimizations. I don't want to put them in BSD grep
directly because I find such optiizations valuable for other regex
users. First, I was thinking of putting it into TRE but now I consider a
better solution building a small library, libregexutils or such. It
would decouple this optimization from the vendor code, making it easier
to deal with future updates and in case for some reason in the far
future we decide again to replace our regex library, this would still
work on top of another one.
3, Adding support for GNU-specific permissive regex syntax
GNU grep accepts regexes that are invalid in POSIX, like [a|]. This is
necessary for grep and diff in the base system. If we don't have them we
can never trow out the GNU regex implementation. However, we should not
make them default, so I'm thinking of implementing them somewhere else,
e.g. in the previously mentioned library.
So far, these are the plans, please comments if you have something in
your mind about it.
Gabor
More information about the freebsd-hackers
mailing list