[RFC] Replacing our regex implementation

Mon May 9 00:47:24 UTC 2011

Hi Folks,

I've been given the opportunity to work in GSoC 2011 to replace our base 
regex library with a more modern one and given that the regex code is 
something essential probably there are lots of interested parties so I 
decided to open a thread here about my plans and the approach that I 
propose. My wiki page is here, you can also take a look at it:
http://wiki.freebsd.org/G%C3%A1borSoC2011

So let's see what this project is about... First some background 
information.

Why to replace?
- it is not efficient
- it does not support wide characters
- it comes from old and unmaintained vendor code

What to use instead?
There are some free implementations around but not all of the suit our 
needs.
- PCRE is not POSIX regex, it only has a POSIX interface but actually 
the regex syntax is not the same
- Oniguruma is slooooooooooooow
- The Plan9 implementation doesn't support full POSIX regexes
- TRE is BSD-licensed, fast, supports wchar, well-documented and still 
actively maintained, so it is the best candidate

Tasks to complete (in priority order)
1, Replace the libc code with TRE code. Should be quite straightforward 
given that TRE is quite mature and has good POSIX-compliance. At the 
moment I'm already working on this. So far, I've discovered two 
differences from our current implementation
- It accepts leaving at the minimum repetition count in curly brackets 
and inferres 0 instead of them. E.g. {,8} is equivalent to {0,8} and {,} 
is equivalent to {0,}. I think it won't be a problem because it is more 
permissive than our current implementation so the supported expressions 
are a superset of the current ones not a subset.
- It doesn't provide the REG_STARTEND macro, which is our non-POSIX 
extension. Still, it is useful and easy to implement so it is not a 
problem either.
Now I'm working on this little feature and on building a libc with TRE. 
After that I'll publish a patch for testing and will also ask portmgr to 
run it on the cluster.

2, Optimizations for matching with a fixed pattern heuristic
Fixed string matching is algorithmically much cheaper than regex. Some 
time ago the GNU grep author was so kind to me that he provided lots of 
valuable comments while GNU grep is fast and one of those was this 
heuristical matching. You can see that mail here: 
http://lists.freebsd.org/pipermail/freebsd-current/2010-August/019358.html
Although TRE has some alternative interfaces to match with an explicitly 
specified length instead of using NUL-terminated strings, the 
performance has to be measured and if it's not efficient enough, I plan 
to implement such optimizations. I don't want to put them in BSD grep 
directly because I find such optiizations valuable for other regex 
users. First, I was thinking of putting it into TRE but now I consider a 
better solution building a small library, libregexutils or such. It 
would decouple this optimization from the vendor code, making it easier 
to deal with future updates and in case for some reason in the far 
future we decide again to replace our regex library, this would still 
work on top of another one.

3, Adding support for GNU-specific permissive regex syntax
GNU grep accepts regexes that are invalid in POSIX, like [a|]. This is 
necessary for grep and diff in the base system. If we don't have them we 
can never trow out the GNU regex implementation. However, we should not 
make them default, so I'm thinking of implementing them somewhere else, 
e.g. in the previously mentioned library.

So far, these are the plans, please comments if you have something in 
your mind about it.

Gabor