grep and Regular expression correctness

Mon Aug 23 13:42:59 UTC 2010

 On 23/08/2010 13:45, Bob Bishop wrote:
> On 23 Aug 2010, at 12:51, chrisk-freebsd at list.mightyreason.com wrote:
>> [...]The hardest part of POSIX regular expressions is in picking out the
>> correct captured subexpressions.  This makes programs like "sed"
>> vulnerable to the bad regex.h engine that comes with the operating system.
>>
>> Luckily grep does not need the captured subexpressions, and thus does
>> not need the complexity that comes from the ideas behind TRE. [etc]
> I think grep does potentially need the captured subexpressions, for eg:
>
> \([abc]\)99\1
>
> matching eg b99b

That would be "basic POSIX" instead of "extended POSIX" regular
expressions.  Making basic regular expressions correct is something I
have never attempted.

Correctness is what I would like to emphasize.  GNU grep defines the
regex.h header for POSIX regular expressions but does not try to deliver
the correct POSIX answer.  Getting the wrong answer quickly is not a
virtue as debugging prematurely optimized code is too hard.

http://www2.research.att.com/~gsf/testregex/re-interpretation.html has
the best explanation of basic and extended POSIX regular expressions. 
And last I checked the AT&T code was nearly correct but exponentially slow.

I hope that if "grep" in *BSD and thus OSX gets worked on that it gets
more correct, not less.

I hope that someday the regex.h engine in *BSD and thus OSX gets fixed,
but I am not holding my breath for that.

All I have written is a correct (and efficient) "extended POSIX"
matching engine in Haskell and OCaml.

Cheers,
  Chris Kuklewicz