Regular expression compilation fail in current

Mark Millard marklmi at yahoo.com
Tue Apr 27 03:14:41 UTC 2021



On 2021-Apr-26, at 06:31, Fernando Apesteguía <fernape at freebsd.org> wrote:

> Hi there,
> 
> I'm working with this port PR
> https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=255182
> 
> and the problem seems to boil down to a regular expression that does
> not compile on current but it does in 12.2.
> 
> The minimum repro is this one:
> 
> #include <regex.h>
> #include <stdio.h>
> 
> int
> main()
> {
>        regex_t regexp;
>        int ret = regcomp(&regexp, "\\s*", REG_EXTENDED | REG_ICASE |
> REG_NOSUB);

Here is my stab at notes for this . . .

It is not all that uncommon for error cases to be
initially mistreated but later toolchains to reject
instead of mistreating the same. I suspect that is
what is going on here. But the details seem to be
as follows.

Using C++11's raw_characters notation to specify
string content, "\\s*" is:

R"%(\s*)%"

In other words, the content of the string is just:

\s*

(3 characters, plus a terminating '\0' present).
It is this later string contant that the regcomp
2nd parameter points to and that leads to the
error report.

The "s" is not valid after the backslash for Basic
Regular Expressions or for Extended Regular Expressions.
( https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html )

REG_EESCAPE is described at:

https://pubs.opengroup.org/onlinepubs/9699919799/functions/regcomp.html

as:

QUOTE
REG_EESCAPE
Trailing <backslash> character in pattern.
END QUOTE

In other words: an extra backslash not paired
with anything valid just after it --so it is
tailing whatever was before it.

If you meant the parameter received to point in
memory to:

\\s*

( 4 characters, plus a terminating '\0' after it,
a.k.a. R"%(\\s*)%" ) you likely want the C-string:

"\\\\s*"

as the argument, shown below:

regcomp(&regexp, "\\\\s*", REG_EXTENDED | REG_ICASE | REG_NOSUB)

If you meant some other character sequence in memory, I'd
have to know what it was to try to back-translate it to
C-source that would produce the correct content in the
memory pointed to.

>        if ( ret != 0) {
>                printf("regexp compilation failed: %d\n", ret);
>        }
> 
>        return 0;
> }
> 
> This one works in 12.2

It might not be rejected, but was does it do? And is that
conformant with:

https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html

?

> but fails to compile the regexp in FreeBSD
> 14.0-CURRENT #11 main-n245984-15221c552b3c with error 5 REG_EESCAPE
> `\' applied to unescapable character.
> 
> Any help is appreciated.

Note: While I used C++11's notation as one way of
indicating string content, no C standard has the
notation to my knowledge.

===
Mark Millard
marklmi at yahoo.com
( dsl-only.net went
away in early 2018-Mar)



More information about the freebsd-hackers mailing list