svn commit: r301461 - in head/lib/libc: gen locale regex

Sun Jun 5 19:49:13 UTC 2016

On 05.06.2016 22:12, Pedro F. Giffuni wrote:
> --- head/lib/libc/regex/regcomp.c	Sun Jun  5 18:16:33 2016	(r301460)
> +++ head/lib/libc/regex/regcomp.c	Sun Jun  5 19:12:52 2016	(r301461)
> @@ -821,10 +821,10 @@ p_b_term(struct parse *p, cset *cs)
>  				(void)REQUIRE((uch)start <= (uch)finish, REG_ERANGE);
>  				CHaddrange(p, cs, start, finish);
>  			} else {
> -				(void)REQUIRE(__collate_range_cmp(table, start, finish) <= 0, REG_ERANGE);
> +				(void)REQUIRE(__wcollate_range_cmp(table, start, finish) <= 0, REG_ERANGE);
>  				for (i = 0; i <= UCHAR_MAX; i++) {
> -					if (   __collate_range_cmp(table, start, i) <= 0
> -					    && __collate_range_cmp(table, i, finish) <= 0
> +					if (   __wcollate_range_cmp(table, start, i) <= 0
> +					    && __wcollate_range_cmp(table, i, finish) <= 0
>  					   )
>  						CHadd(p, cs, i);
>  				}
> 

As I already mention in PR, we have broken regcomp after someone adds
wchar_t support there. Now regcomp ranges works only for the first 256
wchars of the current locale, notice that loop upper limit:
for (i = 0; i <= UCHAR_MAX; i++) {
In general, ranges are either broken in regcomp now or are memory
eating. We have bitmask only for the first 256 wchars, all other added
to the range literally. Imagine what happens if someone specify full
Unicode range in regexp.

Proper fix will be adding bitmask for the whole Unicode range, and even
in that case regcomp attempting to use collation in ranges will be
_very_slow_ since needs to check all Unicode chars in its
for (i = 0; i <= Max_Unicode_wchar; i++) {
loop.

Better stop pretending that we are able to do collation support in the
ranges, since POSIX cares about its own locale only here:
"In the POSIX locale, a range expression represents the set of collating
elements that fall between two elements in the collation sequence,
inclusive. In other locales, a range expression has unspecified
behavior: strictly conforming applications shall not rely on whether the
range expression is valid, or on the set of collating elements matched."

Until whole Unicode range bitmask will be implemented (if ever), better
stop pretending to honor collation order, we just can't do it with
wchars now and do what NetBSD/OpenBSD does (using wchar_t) instead. It
does not prevent memory eating on big ranges (bitmask is needed, see
above), but at least fix the thing that only first 256 wchars are
considered.