svn commit: r301461 - in head/lib/libc: gen locale regex

Pedro Giffuni pfg at FreeBSD.org
Mon Jun 6 13:43:27 UTC 2016



On 06/05/16 14:49, Andrey Chernov wrote:
> On 05.06.2016 22:12, Pedro F. Giffuni wrote:
>> --- head/lib/libc/regex/regcomp.c	Sun Jun  5 18:16:33 2016	(r301460)
>> +++ head/lib/libc/regex/regcomp.c	Sun Jun  5 19:12:52 2016	(r301461)
>> @@ -821,10 +821,10 @@ p_b_term(struct parse *p, cset *cs)
>>  				(void)REQUIRE((uch)start <= (uch)finish, REG_ERANGE);
>>  				CHaddrange(p, cs, start, finish);
>>  			} else {
>> -				(void)REQUIRE(__collate_range_cmp(table, start, finish) <= 0, REG_ERANGE);
>> +				(void)REQUIRE(__wcollate_range_cmp(table, start, finish) <= 0, REG_ERANGE);
>>  				for (i = 0; i <= UCHAR_MAX; i++) {
>> -					if (   __collate_range_cmp(table, start, i) <= 0
>> -					    && __collate_range_cmp(table, i, finish) <= 0
>> +					if (   __wcollate_range_cmp(table, start, i) <= 0
>> +					    && __wcollate_range_cmp(table, i, finish) <= 0
>>  					   )
>>  						CHadd(p, cs, i);
>>  				}
>>
>
> As I already mention in PR, we have broken regcomp after someone adds
> wchar_t support there. Now regcomp ranges works only for the first 256
> wchars of the current locale, notice that loop upper limit:
> for (i = 0; i <= UCHAR_MAX; i++) {
> In general, ranges are either broken in regcomp now or are memory
> eating. We have bitmask only for the first 256 wchars, all other added
> to the range literally. Imagine what happens if someone specify full
> Unicode range in regexp.
>
> Proper fix will be adding bitmask for the whole Unicode range, and even
> in that case regcomp attempting to use collation in ranges will be
> _very_slow_ since needs to check all Unicode chars in its
> for (i = 0; i <= Max_Unicode_wchar; i++) {
> loop.
>
> Better stop pretending that we are able to do collation support in the
> ranges, since POSIX cares about its own locale only here:
> "In the POSIX locale, a range expression represents the set of collating
> elements that fall between two elements in the collation sequence,
> inclusive. In other locales, a range expression has unspecified
> behavior: strictly conforming applications shall not rely on whether the
> range expression is valid, or on the set of collating elements matched."
>
> Until whole Unicode range bitmask will be implemented (if ever), better
> stop pretending to honor collation order, we just can't do it with
> wchars now and do what NetBSD/OpenBSD does (using wchar_t) instead. It
> does not prevent memory eating on big ranges (bitmask is needed, see
> above), but at least fix the thing that only first 256 wchars are
> considered.
>

Sadly regex is one part of the system that could use a maintainer :(,
I have been forced to look at it more than I'd like to but I don't
really use the collation support at all.

Pedro.


More information about the svn-src-head mailing list