regex status report #6
Gabor Kovesdan
gabor at FreeBSD.org
Sun Jul 3 16:33:00 UTC 2011
Hi,
although the optimization code does not work yet, I think I've made good
progress this week. As I said before, I took the quick search algorithm
code from BSD grep and started to refactor it for TRE. It became much
more complex because of the more general domain. It has to support more
POSIX regex features and support single-byte, multi-byte and wide
characters. When wide character support is enabled, the alphabet is much
bigger than for single bytes so the bad character shift table cannot be
an array any more because it may take 4 bytes x 1 bytes at least.
Fortunately, there are only as many distinct values as the distinct
characters in the pattern and this is not so many, so I wrote a
hashtable to store and quickly look up these when running the quick
search algorithm. I'll have to see how it performs with this overhead.
Then the idea is to check Boyer-Moore algorithm and maybe
Apostolico-Giancarlo that theoretically uses the less comparisons among
all of the fixed string matching algorithms. This is quite an experiment
and I don't know if this shortcut will practically beat the full regex
engine for fixed string regexes because of the wchar-related overhead
but still it will be used later as a heuristic to match more complex
patterns and only apply the heavy algorithm on smaller contexts so this
code will surely be useful.
To summarize, at the moment I'm close to finish the fixed string
matching but some debugging is missing.
Gabor
More information about the soc-status
mailing list