regex status report #6

Sun Jul 3 16:33:00 UTC 2011

Hi,

although the optimization code does not work yet, I think I've made good 
progress this week. As I said before, I took the quick search algorithm 
code from BSD grep and started to refactor it for TRE. It became much 
more complex because of the more general domain. It has to support more 
POSIX regex features and support single-byte, multi-byte and wide 
characters. When wide character support is enabled, the alphabet is much 
bigger than for single bytes so the bad character shift table cannot be 
an array any more because it may take 4 bytes x 1 bytes at least. 
Fortunately, there are only as many distinct values as the distinct 
characters in the pattern and this is not so many, so I wrote a 
hashtable to store and quickly look up these when running the quick 
search algorithm. I'll have to see how it performs with this overhead. 
Then the idea is to check Boyer-Moore algorithm and maybe 
Apostolico-Giancarlo that theoretically uses the less comparisons among 
all of the fixed string matching algorithms. This is quite an experiment 
and I don't know if this shortcut will practically beat the full regex 
engine for fixed string regexes because of the wchar-related overhead 
but still it will be used later as a heuristic to match more complex 
patterns and only apply the heavy algorithm on smaller contexts so this 
code will surely be useful.

To summarize, at the moment I'm close to finish the fixed string 
matching but some debugging is missing.

Gabor