Regular expression compilation fail in current

Fernando Apesteguía fernape at freebsd.org
Tue Apr 27 14:09:21 UTC 2021


On Tue, Apr 27, 2021 at 5:14 AM Mark Millard <marklmi at yahoo.com> wrote:
>
>
>
> On 2021-Apr-26, at 06:31, Fernando Apesteguía <fernape at freebsd.org> wrote:
>
> > Hi there,
> >
> > I'm working with this port PR
> > https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=255182
> >
> > and the problem seems to boil down to a regular expression that does
> > not compile on current but it does in 12.2.
> >
> > The minimum repro is this one:
> >
> > #include <regex.h>
> > #include <stdio.h>
> >
> > int
> > main()
> > {
> >        regex_t regexp;
> >        int ret = regcomp(&regexp, "\\s*", REG_EXTENDED | REG_ICASE |
> > REG_NOSUB);
>
> Here is my stab at notes for this . . .
>
> It is not all that uncommon for error cases to be
> initially mistreated but later toolchains to reject
> instead of mistreating the same. I suspect that is
> what is going on here. But the details seem to be
> as follows.
>
> Using C++11's raw_characters notation to specify
> string content, "\\s*" is:
>
> R"%(\s*)%"
>
> In other words, the content of the string is just:
>
> \s*
>
> (3 characters, plus a terminating '\0' present).
> It is this later string contant that the regcomp
> 2nd parameter points to and that leads to the
> error report.
>
> The "s" is not valid after the backslash for Basic
> Regular Expressions or for Extended Regular Expressions.
> ( https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html )
>
> REG_EESCAPE is described at:
>
> https://pubs.opengroup.org/onlinepubs/9699919799/functions/regcomp.html
>
> as:
>
> QUOTE
> REG_EESCAPE
> Trailing <backslash> character in pattern.
> END QUOTE
>
> In other words: an extra backslash not paired
> with anything valid just after it --so it is
> tailing whatever was before it.
>
> If you meant the parameter received to point in
> memory to:
>
> \\s*
>
> ( 4 characters, plus a terminating '\0' after it,
> a.k.a. R"%(\\s*)%" ) you likely want the C-string:
>
> "\\\\s*"
>
> as the argument, shown below:
>
> regcomp(&regexp, "\\\\s*", REG_EXTENDED | REG_ICASE | REG_NOSUB)
>
> If you meant some other character sequence in memory, I'd
> have to know what it was to try to back-translate it to
> C-source that would produce the correct content in the
> memory pointed to.
>
> >        if ( ret != 0) {
> >                printf("regexp compilation failed: %d\n", ret);
> >        }
> >
> >        return 0;
> > }
> >
> > This one works in 12.2
>
> It might not be rejected, but was does it do? And is that
> conformant with:
>
> https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html
>
> ?
>
> > but fails to compile the regexp in FreeBSD
> > 14.0-CURRENT #11 main-n245984-15221c552b3c with error 5 REG_EESCAPE
> > `\' applied to unescapable character.
> >
> > Any help is appreciated.
>
> Note: While I used C++11's notation as one way of
> indicating string content, no C standard has the
> notation to my knowledge.

Thanks for the explanation, Mark.

>
> ===
> Mark Millard
> marklmi at yahoo.com
> ( dsl-only.net went
> away in early 2018-Mar)
>


More information about the freebsd-hackers mailing list